yinizhilianlove

2020年 ICLR 国际会议最终接受论文(poster-paper)列表(四)

来源：AINLPer微信公众号（点击了解一下吧）
编辑: ShuYini
校稿: ShuYini
时间: 2020-02-21

2020年的ICLR会议将于今年的4月26日-4月30日在Millennium Hall, Addis Ababa ETHIOPIA（埃塞俄比亚首都亚的斯亚贝巴千禧大厅）举行。

2020年ICLR会议（Eighth International Conference on Learning Representations）论文接受结果刚刚出来，今年的论文接受情况如下：poster-paper共523篇，Spotlight-paper共107篇，演讲Talk共48篇，共计接受678篇文章，被拒论文（reject-paper）共计1907篇，接受率为：26.48%。

下面是ICLR2020接受的论文(poster-paper)列表，欢迎大家Ctrl+F进行搜索查看。

关注 AINLPer ，回复：ICLR2020 获取会议全部列表PDF，其中一共有四个文件（2020-ICLR-accept-poster.pdf、2020-ICLR-accept-spotlight.pdf、2020-ICLR-accept-talk.pdf、2020-ICLR-reject.pdf）

Kernel of CycleGAN as a principal homogeneous space
Author: Nikita Moriakov, Jonas Adler, Jonas Teuwen
link: https://openreview.net/pdf?id=B1eWOJHKvB
Code: None
Abstract: Unpaired image-to-image translation has attracted significant interest due to the invention of CycleGAN, a method which utilizes a combination of adversarial and cycle consistency losses to avoid the need for paired data. It is known that the CycleGAN problem might admit multiple solutions, and our goal in this paper is to analyze the space of exact solutions and to give perturbation bounds for approximate solutions. We show theoretically that the exact solution space is invariant with respect to automorphisms of the underlying probability spaces, and, furthermore, that the group of automorphisms acts freely and transitively on the space of exact solutions. We examine the case of zero pure CycleGAN loss first in its generality, and, subsequently, expand our analysis to approximate solutions for extended CycleGAN loss where identity loss term is included. In order to demonstrate that these results are applicable, we show that under mild conditions nontrivial smooth automorphisms exist. Furthermore, we provide empirical evidence that neural networks can learn these automorphisms with unexpected and unwanted results. We conclude that finding optimal solutions to the CycleGAN loss does not necessarily lead to the envisioned result in image-to-image translation tasks and that underlying hidden symmetries can render the result useless.
Keyword: Generative models, CycleGAN

Distributionally Robust Neural Networks
Author: Shiori Sagawa*, Pang Wei Koh*, Tatsunori B. Hashimoto, Percy Liang
link: https://openreview.net/pdf?id=ryxGuJrFvS
Code: None
Abstract: Overparameterized neural networks can be highly accurate on average on an i.i.d. test set, yet consistently fail on atypical groups of the data (e.g., by learning spurious correlations that hold on average but not in such groups). Distributionally robust optimization (DRO) allows us to learn models that instead minimize the worst-case training loss over a set of pre-defined groups. However, we find that naively applying group DRO to overparameterized neural networks fails: these models can perfectly fit the training data, and any model with vanishing average training loss also already has vanishing worst-case training loss. Instead, the poor worst-case performance arises from poor generalization on some groups. By coupling group DRO models with increased regularization—stronger-than-typical L2 regularization or early stopping—we achieve substantially higher worst-group accuracies, with 10-40 percentage point improvements on a natural language inference task and two image tasks, while maintaining high average accuracies. Our results suggest that regularization is important for worst-group generalization in the overparameterized regime, even if it is not needed for average generalization. Finally, we introduce a stochastic optimization algorithm for the group DRO setting and provide convergence guarantees for the new algorithm.

Keyword: distributionally robust optimization, deep learning, robustness, generalization, regularization

On Solving Minimax Optimization Locally: A Follow-the-Ridge Approach
Author: Yuanhao Wang*, Guodong Zhang*, Jimmy Ba
link: https://openreview.net/pdf?id=Hkx7_1rKwS
Code: None
Abstract: Many tasks in modern machine learning can be formulated as finding equilibria in sequential games. In particular, two-player zero-sum sequential games, also known as minimax optimization, have received growing interest. It is tempting to apply gradient descent to solve minimax optimization given its popularity and success in supervised learning. However, it has been noted that naive application of gradient descent fails to find some local minimax and can converge to non-local-minimax points. In this paper, we propose Follow-the-Ridge (FR), a novel algorithm that provably converges to and only converges to local minimax. We show theoretically that the algorithm addresses the notorious rotational behaviour of gradient dynamics, and is compatible with preconditioning and positive momentum. Empirically, FR solves toy minimax problems and improves the convergence of GAN training compared to the recent minimax optimization algorithms.
Keyword: minimax optimization, smooth differentiable games, local convergence, generative adversarial networks, optimization

A Neural Dirichlet Process Mixture Model for Task-Free Continual Learning
Author: Soochan Lee, Junsoo Ha, Dongsu Zhang, Gunhee Kim
link: https://openreview.net/pdf?id=SJxSOJStPr
Code: https://github.com/soochan-lee/CN-DPM
Abstract: Despite the growing interest in continual learning, most of its contemporary works have been studied in a rather restricted setting where tasks are clearly distinguishable, and task boundaries are known during training. However, if our goal is to develop an algorithm that learns as humans do, this setting is far from realistic, and it is essential to develop a methodology that works in a task-free manner. Meanwhile, among several branches of continual learning, expansion-based methods have the advantage of eliminating catastrophic forgetting by allocating new resources to learn new data. In this work, we propose an expansion-based approach for task-free continual learning. Our model, named Continual Neural Dirichlet Process Mixture (CN-DPM), consists of a set of neural network experts that are in charge of a subset of the data. CN-DPM expands the number of experts in a principled way under the Bayesian nonparametric framework. With extensive experiments, we show that our model successfully performs task-free continual learning for both discriminative and generative tasks such as image classification and image generation.
Keyword: continual learning, task-free, task-agnostic

Hyper-SAGNN: a self-attention based graph neural network for hypergraphs
Author: Ruochi Zhang, Yuesong Zou, Jian Ma
link: https://openreview.net/pdf?id=ryeHuJBtPH
Code: https://drive.google.com/drive/folders/1kIOc4SlAJllUJsrr2OnZ4izIQIw2JexU?usp=sharing
Abstract: Graph representation learning for hypergraphs can be utilized to extract patterns among higher-order interactions that are critically important in many real world problems. Current approaches designed for hypergraphs, however, are unable to handle different types of hypergraphs and are typically not generic for various learning tasks. Indeed, models that can predict variable-sized heterogeneous hyperedges have not been available. Here we develop a new self-attention based graph neural network called Hyper-SAGNN applicable to homogeneous and heterogeneous hypergraphs with variable hyperedge sizes. We perform extensive evaluations on multiple datasets, including four benchmark network datasets and two single-cell Hi-C datasets in genomics. We demonstrate that Hyper-SAGNN significantly outperforms state-of-the-art methods on traditional tasks while also achieving great performance on a new task called outsider identification. We believe that Hyper-SAGNN will be useful for graph representation learning to uncover complex higher-order interactions in different applications.
Keyword: graph neural network, hypergraph, representation learning

Neural Epitome Search for Architecture-Agnostic Network Compression
Author: Daquan Zhou, Xiaojie Jin, Qibin Hou, Kaixin Wang, Jianchao Yang, Jiashi Feng
link: https://openreview.net/pdf?id=HyxjOyrKvr
Code: None
Abstract: Traditional compression methods including network pruning, quantization, low rank factorization and knowledge distillation all assume that network architectures and parameters should be hardwired. In this work, we propose a new perspective on network compression, i.e., network parameters can be disentangled from the architectures. From this viewpoint, we present the Neural Epitome Search (NES), a new neural network compression approach that learns to find compact yet expressive epitomes for weight parameters of a specified network architecture end-to-end. The complete network to compress can be generated from the learned epitome via a novel transformation method that adaptively transforms the epitomes to match shapes of the given architecture. Compared with existing compression methods, NES allows the weight tensors to be independent of the architecture design and hence can achieve a good trade-off between model compression rate and performance given a specific model size constraint. Experiments demonstrate that, on ImageNet, when taking MobileNetV2 as backbone, our approach improves the full-model baseline by 1.47% in top-1 accuracy with 25% MAdd reduction and AutoML for Model Compression (AMC) by 2.5% with nearly the same compression ratio. Moreover, taking EfficientNet-B0 as baseline, our NES yields an improvement of 1.2% but are with 10% less MAdd. In particular, our method achieves a new state-of-the-art results of 77.5% under mobile settings (<350M MAdd). Code will be made publicly available.
Keyword: Network Compression, Classification, Deep Learning, Weights Sharing

On the Equivalence between Positional Node Embeddings and Structural Graph Representations
Author: Balasubramaniam Srinivasan, Bruno Ribeiro
link: https://openreview.net/pdf?id=SJxzFySKwH
Code: https://github.com/PurdueMINDS/Equivalence
Abstract: This work provides the first unifying theoretical framework for node embeddings and structural graph representations, bridging methods like matrix factorization and graph neural networks. Using invariant theory, we show that relationship between structural representations and node embeddings is analogous to that of a distribution and its samples. We prove that all tasks that can be performed by node embeddings can also be performed by structural representations and vice-versa. We also show that the concept of transductive and inductive learning is unrelated to node embeddings and graph representations, clearing another source of confusion in the literature. Finally, we introduce new practical guidelines to generating and using node embeddings, which further augments standard operating procedures used today.
Keyword: Graph Neural Networks, Structural Graph Representations, Node Embeddings, Relational Learning, Invariant Theory, Theory, Deep Learning, Representational Power, Graph Isomorphism

Probability Calibration for Knowledge Graph Embedding Models
Author: Pedro Tabacof, Luca Costabello
link: https://openreview.net/pdf?id=S1g8K1BFwS
Code: None
Abstract: Knowledge graph embedding research has overlooked the problem of probability calibration. We show popular embedding models are indeed uncalibrated. That means probability estimates associated to predicted triples are unreliable. We present a novel method to calibrate a model when ground truth negatives are not available, which is the usual case in knowledge graphs. We propose to use Platt scaling and isotonic regression alongside our method. Experiments on three datasets with ground truth negatives show our contribution leads to well calibrated models when compared to the gold standard of using negatives. We get significantly better results than the uncalibrated models from all calibration methods. We show isotonic regression offers the best the performance overall, not without trade-offs. We also show that calibrated models reach state-of-the-art accuracy without the need to define relation-specific decision thresholds.
Keyword: knowledge graph embeddings, probability calibration, calibration, graph representation learning, knowledge graphs

Why Not to Use Zero Imputation? Correcting Sparsity Bias in Training Neural Networks
Author: Joonyoung Yi, Juhyuk Lee, Kwang Joon Kim, Sung Ju Hwang, Eunho Yang
link: https://openreview.net/pdf?id=BylsKkHYvH
Code: https://github.com/JoonyoungYi/sparsity-normalization
Abstract: Handling missing data is one of the most fundamental problems in machine learning. Among many approaches, the simplest and most intuitive way is zero imputation, which treats the value of a missing entry simply as zero. However, many studies have experimentally confirmed that zero imputation results in suboptimal performances in training neural networks. Yet, none of the existing work has explained what brings such performance degradations. In this paper, we introduce the variable sparsity problem (VSP), which describes a phenomenon where the output of a predictive model largely varies with respect to the rate of missingness in the given input, and show that it adversarially affects the model performance. We first theoretically analyze this phenomenon and propose a simple yet effective technique to handle missingness, which we refer to as Sparsity Normalization (SN), that directly targets and resolves the VSP. We further experimentally validate SN on diverse benchmark datasets, to show that debiasing the effect of input-level sparsity improves the performance and stabilizes the training of neural networks.
Keyword: Missing Data, Collaborative Filtering, Health Care, Tabular Data, High Dimensional Data, Deep Learning, Neural Networks

DropEdge: Towards Deep Graph Convolutional Networks on Node Classification
Author: Yu Rong, Wenbing Huang, Tingyang Xu, Junzhou Huang
link: https://openreview.net/pdf?id=Hkx1qkrKPr
Code: https://github.com/DropEdge/DropEdge
Abstract: Over-fitting and over-smoothing are two main obstacles of developing deep Graph Convolutional Networks (GCNs) for node classification. In particular, over-fitting weakens the generalization ability on small dataset, while over-smoothing impedes model training by isolating output representations from the input features with the increase in network depth. This paper proposes DropEdge, a novel and flexible technique to alleviate both issues. At its core, DropEdge randomly removes a certain number of edges from the input graph at each training epoch, acting like a data augmenter and also a message passing reducer. Furthermore, we theoretically demonstrate that DropEdge either reduces the convergence speed of over-smoothing or relieves the information loss caused by it. More importantly, our DropEdge is a general skill that can be equipped with many other backbone models (e.g. GCN, ResGCN, GraphSAGE, and JKNet) for enhanced performance. Extensive experiments on several benchmarks verify that DropEdge consistently improves the performance on a variety of both shallow and deep GCNs. The effect of DropEdge on preventing over-smoothing is empirically visualized and validated as well. Codes are released on~
Keyword: graph neural network, over-smoothing, over-fitting, dropedge, graph convolutional networks

Masked Based Unsupervised Content Transfer
Author: Ron Mokady, Sagie Benaim, Lior Wolf, Amit Bermano
link: https://openreview.net/pdf?id=BJe-91BtvH
Code: https://github.com/rmokady/mbu-content-tansfer
Abstract: We consider the problem of translating, in an unsupervised manner, between two domains where one contains some additional information compared to the other. The proposed method disentangles the common and separate parts of these domains and, through the generation of a mask, focuses the attention of the underlying network to the desired augmentation alone, without wastefully reconstructing the entire target. This enables state-of-the-art quality and variety of content translation, as demonstrated through extensive quantitative and qualitative evaluation. Our method is also capable of adding the separate content of different guide images and domains as well as remove existing separate content. Furthermore, our method enables weakly-supervised semantic segmentation of the separate part of each domain, where only class labels are provided. Our code is available at
Keyword: None

U-GAT-IT: Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization for Image-to-Image Translation
Author: Junho Kim, Minjae Kim, Hyeonwoo Kang, Kwang Hee Lee
link: https://openreview.net/pdf?id=BJlZ5ySKPH
Code: https://github.com/taki0112/UGATIT
Abstract: We propose a novel method for unsupervised image-to-image translation, which incorporates a new attention module and a new learnable normalization function in an end-to-end manner. The attention module guides our model to focus on more important regions distinguishing between source and target domains based on the attention map obtained by the auxiliary classifier. Unlike previous attention-based method which cannot handle the geometric changes between domains, our model can translate both images requiring holistic changes and images requiring large shape changes. Moreover, our new AdaLIN (Adaptive Layer-Instance Normalization) function helps our attention-guided model to flexibly control the amount of change in shape and texture by learned parameters depending on datasets. Experimental results show the superiority of the proposed method compared to the existing state-of-the-art models with a fixed network architecture and hyper-parameters.
Keyword: Image-to-Image Translation, Generative Attentional Networks, Adaptive Layer-Instance Normalization

Inductive and Unsupervised Representation Learning on Graph Structured Objects
Author: Lichen Wang, Bo Zong, Qianqian Ma, Wei Cheng, Jingchao Ni, Wenchao Yu, Yanchi Liu, Dongjin Song, Haifeng Chen, Yun Fu
link: https://openreview.net/pdf?id=rkem91rtDB
Code: None
Abstract: Inductive and unsupervised graph learning is a critical technique for predictive or information retrieval tasks where label information is difficult to obtain. It is also challenging to make graph learning inductive and unsupervised at the same time, as learning processes guided by reconstruction error based loss functions inevitably demand graph similarity evaluation that is usually computationally intractable. In this paper, we propose a general framework SEED (Sampling, Encoding, and Embedding Distributions) for inductive and unsupervised representation learning on graph structured objects. Instead of directly dealing with the computational challenges raised by graph similarity evaluation, given an input graph, the SEED framework samples a number of subgraphs whose reconstruction errors could be efficiently evaluated, encodes the subgraph samples into a collection of subgraph vectors, and employs the embedding of the subgraph vector distribution as the output vector representation for the input graph. By theoretical analysis, we demonstrate the close connection between SEED and graph isomorphism. Using public benchmark datasets, our empirical study suggests the proposed SEED framework is able to achieve up to 10% improvement, compared with competitive baseline methods.
Keyword: Graph representation learning, Graph isomorphism, Graph similarity learning

Batch-shaping for learning conditional channel gated networks
Author: Babak Ehteshami Bejnordi, Tijmen Blankevoort, Max Welling
link: https://openreview.net/pdf?id=Bke89JBtvB
Code: None
Abstract: We present a method that trains large capacity neural networks with significantly improved accuracy and lower dynamic computational cost. This is achieved by gating the deep-learning architecture on a fine-grained-level. Individual convolutional maps are turned on/off conditionally on features in the network. To achieve this, we introduce a new residual block architecture that gates convolutional channels in a fine-grained manner. We also introduce a generally applicable tool batch-shaping that matches the marginal aggregate posteriors of features in a neural network to a pre-specified prior distribution. We use this novel technique to force gates to be more conditional on the data. We present results on CIFAR-10 and ImageNet datasets for image classification, and Cityscapes for semantic segmentation. Our results show that our method can slim down large architectures conditionally, such that the average computational cost on the data is on par with a smaller architecture, but with higher accuracy. In particular, on ImageNet, our ResNet50 and ResNet34 gated networks obtain 74.60% and 72.55% top-1 accuracy compared to the 69.76% accuracy of the baseline ResNet18 model, for similar complexity. We also show that the resulting networks automatically learn to use more features for difficult examples and fewer features for simple examples.
Keyword: Conditional computation, channel gated networks, gating, Batch-shaping, distribution matching, image classification, semantic segmentation

Learning Robust Representations via Multi-View Information Bottleneck
Author: Marco Federici, Anjan Dutta, Patrick Forré, Nate Kushman, Zeynep Akata
link: https://openreview.net/pdf?id=B1xwcyHFDr
Code: https://github.com/mfederici/Multi-View-Information-Bottleneck
Abstract: The information bottleneck principle provides an information-theoretic method for representation learning, by training an encoder to retain all information which is relevant for predicting the label while minimizing the amount of other, excess information in the representation. The original formulation, however, requires labeled data to identify the superfluous information. In this work, we extend this ability to the multi-view unsupervised setting, where two views of the same underlying entity are provided but the label is unknown. This enables us to identify superfluous information as that not shared by both views. A theoretical analysis leads to the definition of a new multi-view model that produces state-of-the-art results on the Sketchy dataset and label-limited versions of the MIR-Flickr dataset. We also extend our theory to the single-view setting by taking advantage of standard data augmentation techniques, empirically showing better generalization capabilities when compared to common unsupervised approaches for representation learning.
Keyword: Information Bottleneck, Multi-View Learning, Representation Learning, Information Theory

Deep probabilistic subsampling for task-adaptive compressed sensing
Author: Iris A.M. Huijben, Bastiaan S. Veeling, Ruud J.G. van Sloun
link: https://openreview.net/pdf?id=SJeq9JBFvH
Code: None
Abstract: The field of deep learning is commonly concerned with optimizing predictive models using large pre-acquired datasets of densely sampled datapoints or signals. In this work, we demonstrate that the deep learning paradigm can be extended to incorporate a subsampling scheme that is jointly optimized under a desired minimum sample rate. We present Deep Probabilistic Subsampling (DPS), a widely applicable framework for task-adaptive compressed sensing that enables end-to end optimization of an optimal subset of signal samples with a subsequent model that performs a required task. We demonstrate strong performance on reconstruction and classification tasks of a toy dataset, MNIST, and CIFAR10 under stringent subsampling rates in both the pixel and the spatial frequency domain. Due to the task-agnostic nature of the framework, DPS is directly applicable to all real-world domains that benefit from sample rate reduction.
Keyword: None

Robust anomaly detection and backdoor attack detection via differential privacy
Author: Min Du, Ruoxi Jia, Dawn Song
link: https://openreview.net/pdf?id=SJx0q1rtvS
Code: https://www.dropbox.com/sh/rt8qzii7wr07g6n/AAAbwokv2sfBeE9XAL2pXv_Aa?dl=0
Abstract: Outlier detection and novelty detection are two important topics for anomaly detection. Suppose the majority of a dataset are drawn from a certain distribution, outlier detection and novelty detection both aim to detect data samples that do not fit the distribution. Outliers refer to data samples within this dataset, while novelties refer to new samples. In the meantime, backdoor poisoning attacks for machine learning models are achieved through injecting poisoning samples into the training dataset, which could be regarded as “outliers” that are intentionally added by attackers. Differential privacy has been proposed to avoid leaking any individual’s information, when aggregated analysis is performed on a given dataset. It is typically achieved by adding random noise, either directly to the input dataset, or to intermediate results of the aggregation mechanism. In this paper, we demonstrate that applying differential privacy could improve the utility of outlier detection and novelty detection, with an extension to detect poisoning samples in backdoor attacks. We first present a theoretical analysis on how differential privacy helps with the detection, and then conduct extensive experiments to validate the effectiveness of differential privacy in improving outlier detection, novelty detection, and backdoor attack detection.
Keyword: outlier detection, novelty detection, backdoor attack detection, system log anomaly detection, differential privacy

Learning to Guide Random Search
Author: Ozan Sener, Vladlen Koltun
link: https://openreview.net/pdf?id=B1gHokBKwS
Code: https://github.com/intel-isl/LMRS
Abstract: We are interested in derivative-free optimization of high-dimensional functions. The sample complexity of existing methods is high and depends on problem dimensionality, unlike the dimensionality-independent rates of first-order methods. The recent success of deep learning suggests that many datasets lie on low-dimensional manifolds that can be represented by deep nonlinear models. We therefore consider derivative-free optimization of a high-dimensional function that lies on a latent low-dimensional manifold. We develop an online learning approach that learns this manifold while performing the optimization. In other words, we jointly learn the manifold and optimize the function. Our analysis suggests that the presented method significantly reduces sample complexity. We empirically evaluate the method on continuous optimization benchmarks and high-dimensional continuous control problems. Our method achieves significantly lower sample complexity than Augmented Random Search, Bayesian optimization, covariance matrix adaptation (CMA-ES), and other derivative-free optimization algorithms.
Keyword: Random search, Derivative-free optimization, Learning continuous control

Lagrangian Fluid Simulation with Continuous Convolutions
Author: Benjamin Ummenhofer, Lukas Prantl, Nils Thuerey, Vladlen Koltun
link: https://openreview.net/pdf?id=B1lDoJSYDH
Code: None
Abstract: We present an approach to Lagrangian fluid simulation with a new type of convolutional network. Our networks process sets of moving particles, which describe fluids in space and time. Unlike previous approaches, we do not build an explicit graph structure to connect the particles but use spatial convolutions as the main differentiable operation that relates particles to their neighbors. To this end we present a simple, novel, and effective extension of N-D convolutions to the continuous domain. We show that our network architecture can simulate different materials, generalizes to arbitrary collision geometries, and can be used for inverse problems. In addition, we demonstrate that our continuous convolutions outperform prior formulations in terms of accuracy and speed.

Keyword: particle-based physics, fluid mechanics, continuous convolutions, material estimation

Reinforced Genetic Algorithm Learning for Optimizing Computation Graphs
Author: Aditya Paliwal, Felix Gimeno, Vinod Nair, Yujia Li, Miles Lubin, Pushmeet Kohli, Oriol Vinyals
link: https://openreview.net/pdf?id=rkxDoJBYPB
Code: None
Abstract: We present a deep reinforcement learning approach to minimizing the execution cost of neural network computation graphs in an optimizing compiler. Unlike earlier learning-based works that require training the optimizer on the same graph to be optimized, we propose a learning approach that trains an optimizer offline and then generalizes to previously unseen graphs without further training. This allows our approach to produce high-quality execution decisions on real-world TensorFlow graphs in seconds instead of hours. We consider two optimization tasks for computation graphs: minimizing running time and peak memory usage. In comparison to an extensive set of baselines, our approach achieves significant improvements over classical and other learning-based methods on these two tasks.
Keyword: reinforcement learning, learning to optimize, combinatorial optimization, computation graphs, model parallelism, learning for systems

Compressive Transformers for Long-Range Sequence Modelling
Author: Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, Timothy P. Lillicrap
link: https://openreview.net/pdf?id=SylKikSYDH
Code: None
Abstract: We present the Compressive Transformer, an attentive sequence model which compresses past memories for long-range sequence learning. We find the Compressive Transformer obtains state-of-the-art language modelling results in the WikiText-103 and Enwik8 benchmarks, achieving 17.1 ppl and 0.97bpc respectively. We also find it can model high-frequency speech effectively and can be used as a memory mechanism for RL, demonstrated on an object matching task. To promote the domain of long-range sequence learning, we propose a new open-vocabulary language modelling benchmark derived from books, PG-19.
Keyword: memory, language modeling, transformer, compression

A Stochastic Derivative Free Optimization Method with Momentum
Author: Eduard Gorbunov, Adel Bibi, Ozan Sener, El Houcine Bergou, Peter Richtarik
link: https://openreview.net/pdf?id=HylAoJSKvH
Code: None
Abstract: We consider the problem of unconstrained minimization of a smooth objective
function in $\mathbb{R}^d$ in setting where only function evaluations are possible. We propose and analyze stochastic zeroth-order method with heavy ball momentum. In particular, we propose, SMTP, a momentum version of the stochastic three-point method (STP) Bergou et al. (2019). We show new complexity results for non-convex, convex and strongly convex functions. We test our method on a collection of learning to continuous control tasks on several MuJoCo Todorov et al. (2012) environments with varying difficulty and compare against STP, other state-of-the-art derivative-free optimization algorithms and against policy gradient methods. SMTP significantly outperforms STP and all other methods that we considered in our numerical experiments. Our second contribution is SMTP with importance sampling which we call SMTP_IS. We provide convergence analysis of this method for non-convex, convex and strongly convex objectives.
Keyword: derivative-free optimization, stochastic optimization, heavy ball momentum, importance sampling

Understanding and Improving Information Transfer in Multi-Task Learning
Author: Sen Wu, Hongyang Zhang, Christopher Ré
link: https://openreview.net/pdf?id=SylzhkBtDB
Code: None
Abstract: We investigate multi-task learning approaches which use a shared feature representation for all tasks. To better understand the transfer of task information, we study an architecture with a shared module for all tasks and a separate output module for each task. We study the theory of this setting on linear and ReLU-activated models. Our key observation is that whether or not tasks’ data are well-aligned can significantly affect the performance of multi-task learning. We show that misalignment between task data can cause negative transfer (or hurt performance) and provide sufficient conditions for positive transfer. Inspired by the theoretical insights, we show that aligning tasks’ embedding layers leads to performance gains for multi-task training and transfer learning on the GLUE benchmark and sentiment analysis tasks; for example, we obtained a 2.35% GLUE score average improvement on 5 GLUE tasks over BERT LARGE using our alignment method. We also design an SVD-based task re-weighting scheme and show that it improves the robustness of multi-task training on a multi-label image dataset.
Keyword: Multi-Task Learning

Learning To Explore Using Active Neural SLAM
Author: Devendra Singh Chaplot, Dhiraj Gandhi, Saurabh Gupta, Abhinav Gupta, Ruslan Salakhutdinov
link: https://openreview.net/pdf?id=HklXn1BKDH
Code: https://github.com/devendrachaplot/Neural-SLAM
Abstract: This work presents a modular and hierarchical approach to learn policies for exploring 3D environments, called `Active Neural SLAM’. Our approach leverages the strengths of both classical and learning-based methods, by using analytical path planners with learned SLAM module, and global and local policies. The use of learning provides flexibility with respect to input modalities (in the SLAM module), leverages structural regularities of the world (in global policies), and provides robustness to errors in state estimation (in local policies). Such use of learning within each module retains its benefits, while at the same time, hierarchical decomposition and modular training allow us to sidestep the high sample complexities associated with training end-to-end policies. Our experiments in visually and physically realistic simulated 3D environments demonstrate the effectiveness of our approach over past learning and geometry-based approaches. The proposed model can also be easily transferred to the PointGoal task and was the winning entry of CVPR 2019 Habitat PointGoal Navigation Challenge.

Keyword: Navigation, Exploration

EMPIR: Ensembles of Mixed Precision Deep Networks for Increased Robustness Against Adversarial Attacks
Author: Sanchari Sen, Balaraman Ravindran, Anand Raghunathan
link: https://openreview.net/pdf?id=HJem3yHKwH
Code: https://github.com/sancharisen/EMPIR
Abstract: Ensuring robustness of Deep Neural Networks (DNNs) is crucial to their adoption in safety-critical applications such as self-driving cars, drones, and healthcare. Notably, DNNs are vulnerable to adversarial attacks in which small input perturbations can produce catastrophic misclassifications. In this work, we propose EMPIR, ensembles of quantized DNN models with different numerical precisions, as a new approach to increase robustness against adversarial attacks. EMPIR is based on the observation that quantized neural networks often demonstrate much higher robustness to adversarial attacks than full precision networks, but at the cost of a substantial loss in accuracy on the original (unperturbed) inputs. EMPIR overcomes this limitation to achieve the ``best of both worlds", i.e., the higher unperturbed accuracies of the full precision models combined with the higher robustness of the low precision models, by composing them in an ensemble. Further, as low precision DNN models have significantly lower computational and storage requirements than full precision models, EMPIR models only incur modest compute and memory overheads compared to a single full-precision model (<25% in our evaluations). We evaluate EMPIR across a suite of DNNs for 3 different image recognition tasks (MNIST, CIFAR-10 and ImageNet) and under 4 different adversarial attacks. Our results indicate that EMPIR boosts the average adversarial accuracies by 42.6%, 15.2% and 10.5% for the DNN models trained on the MNIST, CIFAR-10 and ImageNet datasets respectively, when compared to single full-precision models, without sacrificing accuracy on the unperturbed inputs.
Keyword: ensembles, mixed precision, robustness, adversarial attacks

Quantifying Point-Prediction Uncertainty in Neural Networks via Residual Estimation with an I/O Kernel
Author: Xin Qiu, Elliot Meyerson, Risto Miikkulainen
link: https://openreview.net/pdf?id=rkxNh1Stvr
Code: https://github.com/leaf-ai/rio-paper
Abstract: Neural Networks (NNs) have been extensively used for a wide spectrum of real-world regression tasks, where the goal is to predict a numerical outcome such as revenue, effectiveness, or a quantitative result. In many such tasks, the point prediction is not enough: the uncertainty (i.e. risk or confidence) of that prediction must also be estimated. Standard NNs, which are most often used in such tasks, do not provide uncertainty information. Existing approaches address this issue by combining Bayesian models with NNs, but these models are hard to implement, more expensive to train, and usually do not predict as accurately as standard NNs. In this paper, a new framework (RIO) is developed that makes it possible to estimate uncertainty in any pretrained standard NN. The behavior of the NN is captured by modeling its prediction residuals with a Gaussian Process, whose kernel includes both the NN’s input and its output. The framework is justified theoretically and evaluated in twelve real-world datasets, where it is found to (1) provide reliable estimates of uncertainty, (2) reduce the error of the point predictions, and (3) scale well to large datasets. Given that RIO can be applied to any standard NN without modifications to model architecture or training pipeline, it provides an important ingredient for building real-world NN applications.
Keyword: Uncertainty Estimation, Neural Networks, Gaussian Process

B-Spline CNNs on Lie groups
Author: Erik J Bekkers
link: https://openreview.net/pdf?id=H1gBhkBFDH
Code: https://github.com/ebekkers/gsplinets
Abstract: Group convolutional neural networks (G-CNNs) can be used to improve classical CNNs by equipping them with the geometric structure of groups. Central in the success of G-CNNs is the lifting of feature maps to higher dimensional disentangled representations, in which data characteristics are effectively learned, geometric data-augmentations are made obsolete, and predictable behavior under geometric transformations (equivariance) is guaranteed via group theory. Currently, however, the practical implementations of G-CNNs are limited to either discrete groups (that leave the grid intact) or continuous compact groups such as rotations (that enable the use of Fourier theory). In this paper we lift these limitations and propose a modular framework for the design and implementation of G-CNNs for arbitrary Lie groups. In our approach the differential structure of Lie groups is used to expand convolution kernels in a generic basis of B-splines that is defined on the Lie algebra. This leads to a flexible framework that enables localized, atrous, and deformable convolutions in G-CNNs by means of respectively localized, sparse and non-uniform B-spline expansions. The impact and potential of our approach is studied on two benchmark datasets: cancer detection in histopathology slides (PCam dataset) in which rotation equivariance plays a key role and facial landmark localization (CelebA dataset) in which scale equivariance is important. In both cases, G-CNN architectures outperform their classical 2D counterparts and the added value of atrous and localized group convolutions is studied in detail.
Keyword: equivariance, Lie groups, B-Splines, G-CNNs, deep learning, group convolution, computer vision, medical image analysis

Neural Outlier Rejection for Self-Supervised Keypoint Learning
Author: Jiexiong Tang, Hanme Kim, Vitor Guizilini, Sudeep Pillai, Rares Ambrus
link: https://openreview.net/pdf?id=Skx82ySYPH
Code: https://github.com/TRI-ML/KP2D
Abstract: Identifying salient points in images is a crucial component for visual odometry, Structure-from-Motion or SLAM algorithms. Recently, several learned keypoint methods have demonstrated compelling performance on challenging benchmarks. However, generating consistent and accurate training data for interest-point detection in natural images still remains challenging, especially for human annotators. We introduce IO-Net (i.e. InlierOutlierNet), a novel proxy task for the self-supervision of keypoint detection, description and matching. By making the sampling of inlier-outlier sets from point-pair correspondences fully differentiable within the keypoint learning framework, we show that are able to simultaneously self-supervise keypoint description and improve keypoint matching. Second, we introduce KeyPointNet, a keypoint-network architecture that is especially amenable to robust keypoint detection and description. We design the network to allow local keypoint aggregation to avoid artifacts due to spatial discretizations commonly used for this task, and we improve fine-grained keypoint descriptor performance by taking advantage of efficient sub-pixel convolutions to upsample the descriptor feature-maps to a higher operating resolution. Through extensive experiments and ablative analysis, we show that the proposed self-supervised keypoint learning method greatly improves the quality of feature matching and homography estimation on challenging benchmarks over the state-of-the-art.
Keyword: Self-Supervised Learning, Keypoint Detection, Outlier Rejection, Deep Learning

Reducing Transformer Depth on Demand with Structured Dropout
Author: Angela Fan, Edouard Grave, Armand Joulin
link: https://openreview.net/pdf?id=SylO2yStDr
Code: None
Abstract: Overparametrized transformer networks have obtained state of the art results in various natural language processing tasks, such as machine translation, language modeling, and question answering. These models contain hundreds of millions of parameters, necessitating a large amount of computation and making them prone to overfitting. In this work, we explore LayerDrop, a form of structured dropout, which has a regularization effect during training and allows for efficient pruning at inference time. In particular, we show that it is possible to select sub-networks of any depth from one large network without having to finetune them and with limited impact on performance. We demonstrate the effectiveness of our approach by improving the state of the art on machine translation, language modeling, summarization, question answering, and language understanding benchmarks. Moreover, we show that our approach leads to small BERT-like models of higher quality than when training from scratch or using distillation.
Keyword: reduction, regularization, pruning, dropout, transformer

Cross-Lingual Ability of Multilingual BERT: An Empirical Study
Author: Karthikeyan K, Zihan Wang, Stephen Mayhew, Dan Roth
link: https://openreview.net/pdf?id=HJeT3yrtDr
Code: None
Abstract: Recent work has exhibited the surprising cross-lingual abilities of multilingual BERT (M-BERT) – surprising since it is trained without any cross-lingual objective and with no aligned data. In this work, we provide a comprehensive study of the contribution of different components in M-BERT to its cross-lingual ability. We study the impact of linguistic properties of the languages, the architecture of the model, and the learning objectives. The experimental study is done in the context of three typologically different languages – Spanish, Hindi, and Russian – and using two conceptually different NLP tasks, textual entailment and named entity recognition. Among our key conclusions is the fact that the lexical overlap between languages plays a negligible role in the cross-lingual success, while the depth of the network is an integral part of it. All our models and implementations can be found on our project page:
Keyword: Cross-Lingual Learning, Multilingual BERT

SPACE: Unsupervised Object-Oriented Scene Representation via Spatial Attention and Decomposition
Author: Zhixuan Lin, Yi-Fu Wu, Skand Vishwanath Peri, Weihao Sun, Gautam Singh, Fei Deng, Jindong Jiang, Sungjin Ahn
link: https://openreview.net/pdf?id=rkl03ySYDH
Code: None
Abstract: The ability to decompose complex multi-object scenes into meaningful abstractions like objects is fundamental to achieve higher-level cognition. Previous approaches for unsupervised object-oriented scene representation learning are either based on spatial-attention or scene-mixture approaches and limited in scalability which is a main obstacle towards modeling real-world scenes. In this paper, we propose a generative latent variable model, called SPACE, that provides a uniﬁed probabilistic modeling framework that combines the best of spatial-attention and scene-mixture approaches. SPACE can explicitly provide factorized object representations for foreground objects while also decomposing background segments of complex morphology. Previous models are good at either of these, but not both. SPACE also resolves the scalability problems of previous methods by incorporating parallel spatial-attention and thus is applicable to scenes with a large number of objects without performance degradations. We show through experiments on Atari and 3D-Rooms that SPACE achieves the above properties consistently in comparison to SPAIR, IODINE, and GENESIS. Results of our experiments can be found on our project website:
Keyword: Generative models, Unsupervised scene representation, Object-oriented representation, spatial attention

RIDE: Rewarding Impact-Driven Exploration for Procedurally-Generated Environments
Author: Roberta Raileanu, Tim Rocktäschel
link: https://openreview.net/pdf?id=rkg-TJBFPB
Code: https://github.com/facebookresearch/impact-driven-exploration
Abstract: Exploration in sparse reward environments remains one of the key challenges of model-free reinforcement learning. Instead of solely relying on extrinsic rewards provided by the environment, many state-of-the-art methods use intrinsic rewards to encourage exploration. However, we show that existing methods fall short in procedurally-generated environments where an agent is unlikely to visit a state more than once. We propose a novel type of intrinsic reward which encourages the agent to take actions that lead to significant changes in its learned state representation. We evaluate our method on multiple challenging procedurally-generated tasks in MiniGrid, as well as on tasks with high-dimensional observations used in prior work. Our experiments demonstrate that this approach is more sample efficient than existing exploration methods, particularly for procedurally-generated MiniGrid environments. Furthermore, we analyze the learned behavior as well as the intrinsic reward received by our agent. In contrast to previous approaches, our intrinsic reward does not diminish during the course of training and it rewards the agent substantially more for interacting with objects that it can control.
Keyword: reinforcement learning, exploration, curiosity

Low-dimensional statistical manifold embedding of directed graphs
Author: Thorben Funke, Tian Guo, Alen Lancic, Nino Antulov-Fantulin
link: https://openreview.net/pdf?id=SkxQp1StDH
Code: None
Abstract: We propose a novel node embedding of directed graphs to statistical manifolds, which is based on a global minimization of pairwise relative entropy and graph geodesics in a non-linear way. Each node is encoded with a probability density function over a measurable space. Furthermore, we analyze the connection of the geometrical properties of such embedding and their efficient learning procedure. Extensive experiments show that our proposed embedding is better preserving the global geodesic information of graphs, as well as outperforming existing embedding models on directed graphs in a variety of evaluation metrics, in an unsupervised setting.
Keyword: graph embedding, information geometry, graph representations

Efficient Probabilistic Logic Reasoning with Graph Neural Networks
Author: Yuyu Zhang, Xinshi Chen, Yuan Yang, Arun Ramamurthy, Bo Li, Yuan Qi, Le Song
link: https://openreview.net/pdf?id=rJg76kStwH
Code: https://github.com/expressGNN/ExpressGNN
Abstract: Markov Logic Networks (MLNs), which elegantly combine logic rules and probabilistic graphical models, can be used to address many knowledge graph problems. However, inference in MLN is computationally intensive, making the industrial-scale application of MLN very difficult. In recent years, graph neural networks (GNNs) have emerged as efficient and effective tools for large-scale graph problems. Nevertheless, GNNs do not explicitly incorporate prior logic rules into the models, and may require many labeled examples for a target task. In this paper, we explore the combination of MLNs and GNNs, and use graph neural networks for variational inference in MLN. We propose a GNN variant, named ExpressGNN, which strikes a nice balance between the representation power and the simplicity of the model. Our extensive experiments on several benchmark datasets demonstrate that ExpressGNN leads to effective and efficient probabilistic logic reasoning.
Keyword: probabilistic logic reasoning, Markov Logic Networks, graph neural networks

GraphSAINT: Graph Sampling Based Inductive Learning Method
Author: Hanqing Zeng, Hongkuan Zhou, Ajitesh Srivastava, Rajgopal Kannan, Viktor Prasanna
link: https://openreview.net/pdf?id=BJe8pkHFwS
Code: https://github.com/GraphSAINT/GraphSAINT
Abstract: Graph Convolutional Networks (GCNs) are powerful models for learning representations of attributed graphs. To scale GCNs to large graphs, state-of-the-art methods use various layer sampling techniques to alleviate the “neighbor explosion” problem during minibatch training. We propose GraphSAINT, a graph sampling based inductive learning method that improves training efficiency and accuracy in a fundamentally different way. By changing perspective, GraphSAINT constructs minibatches by sampling the training graph, rather than the nodes or edges across GCN layers. Each iteration, a complete GCN is built from the properly sampled subgraph. Thus, we ensure fixed number of well-connected nodes in all layers. We further propose normalization technique to eliminate bias, and sampling algorithms for variance reduction. Importantly, we can decouple the sampling from the forward and backward propagation, and extend GraphSAINT with many architecture variants (e.g., graph attention, jumping connection). GraphSAINT demonstrates superior performance in both accuracy and training time on five large graphs, and achieves new state-of-the-art F1 scores for PPI (0.995) and Reddit (0.970).
Keyword: Graph Convolutional Networks, Graph sampling, Network embedding

You Only Train Once: Loss-Conditional Training of Deep Networks
Author: Alexey Dosovitskiy, Josip Djolonga
link: https://openreview.net/pdf?id=HyxY6JHKwr
Code: None
Abstract: In many machine learning problems, loss functions are weighted sums of several terms. A typical approach to dealing with these is to train multiple separate models with different selections of weights and then either choose the best one according to some criterion or keep multiple models if it is desirable to maintain a diverse set of solutions. This is inefficient both at training and at inference time. We propose a method that allows replacing multiple models trained on one loss function each by a single model trained on a distribution of losses. At test time a model trained this way can be conditioned to generate outputs corresponding to any loss from the training distribution of losses. We demonstrate this approach on three tasks with parametrized losses: beta-VAE, learned image compression, and fast style transfer.
Keyword: deep learning, image generation

Projection-Based Constrained Policy Optimization
Author: Tsung-Yen Yang, Justinian Rosca, Karthik Narasimhan, Peter J. Ramadge
link: https://openreview.net/pdf?id=rke3TJrtPS
Code: https://sites.google.com/view/iclr2020-pcpo
Abstract: We consider the problem of learning control policies that optimize a reward function while satisfying constraints due to considerations of safety, fairness, or other costs. We propose a new algorithm - Projection-Based Constrained Policy Optimization (PCPO), an iterative method for optimizing policies in a two-step process - the first step performs an unconstrained update while the second step reconciles the constraint violation by projecting the policy back onto the constraint set. We theoretically analyze PCPO and provide a lower bound on reward improvement, as well as an upper bound on constraint violation for each policy update. We further characterize the convergence of PCPO with projection based on two different metrics - L2 norm and Kullback-Leibler divergence. Our empirical results over several control tasks demonstrate that our algorithm achieves superior performance, averaging more than 3.5 times less constraint violation and around 15% higher reward compared to state-of-the-art methods.
Keyword: Reinforcement learning with constraints, Safe reinforcement learning

Infinite-Horizon Differentiable Model Predictive Control
Author: Sebastian East, Marco Gallieri, Jonathan Masci, Jan Koutnik, Mark Cannon
link: https://openreview.net/pdf?id=ryxC6kSYPr
Code: None
Abstract: This paper proposes a differentiable linear quadratic Model Predictive Control (MPC) framework for safe imitation learning. The infinite-horizon cost is enforced using a terminal cost function obtained from the discrete-time algebraic Riccati equation (DARE), so that the learned controller can be proven to be stabilizing in closed-loop. A central contribution is the derivation of the analytical derivative of the solution of the DARE, thereby allowing the use of differentiation-based learning methods. A further contribution is the structure of the MPC optimization problem: an augmented Lagrangian method ensures that the MPC optimization is feasible throughout training whilst enforcing hard constraints on state and input, and a pre-stabilizing controller ensures that the MPC solution and derivatives are accurate at each iteration. The learning capabilities of the framework are demonstrated in a set of numerical studies.
Keyword: Model Predictive Control, Riccati Equation, Imitation Learning, Safe Learning

Combining Q-Learning and Search with Amortized Value Estimates
Author: Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Tobias Pfaff, Theophane Weber, Lars Buesing, Peter W. Battaglia
link: https://openreview.net/pdf?id=SkeAaJrKDS
Code: None
Abstract: We introduce “Search with Amortized Value Estimates” (SAVE), an approach for combining model-free Q-learning with model-based Monte-Carlo Tree Search (MCTS). In SAVE, a learned prior over state-action values is used to guide MCTS, which estimates an improved set of state-action values. The new Q-estimates are then used in combination with real experience to update the prior. This effectively amortizes the value computation performed by MCTS, resulting in a cooperative relationship between model-free learning and model-based search. SAVE can be implemented on top of any Q-learning agent with access to a model, which we demonstrate by incorporating it into agents that perform challenging physical reasoning tasks and Atari. SAVE consistently achieves higher rewards with fewer training steps, and—in contrast to typical model-based search approaches—yields strong performance with very small search budgets. By combining real experience with information computed during search, SAVE demonstrates that it is possible to improve on both the performance of model-free learning and the computational cost of planning.
Keyword: model-based RL, Q-learning, MCTS, search

Training Generative Adversarial Networks from Incomplete Observations using Factorised Discriminators
Author: Daniel Stoller, Sebastian Ewert, Simon Dixon
link: https://openreview.net/pdf?id=Hye1RJHKwB
Code: https://www.dropbox.com/s/gtc7m7pc4n2yt05/source.zip?dl=1
Abstract: Generative adversarial networks (GANs) have shown great success in applications such as image generation and inpainting.
However, they typically require large datasets, which are often not available, especially in the context of prediction tasks such as image segmentation that require labels. Therefore, methods such as the CycleGAN use more easily available unlabelled data, but do not offer a way to leverage additional labelled data for improved performance. To address this shortcoming, we show how to factorise the joint data distribution into a set of lower-dimensional distributions along with their dependencies. This allows splitting the discriminator in a GAN into multiple “sub-discriminators” that can be independently trained from incomplete observations. Their outputs can be combined to estimate the density ratio between the joint real and the generator distribution, which enables training generators as in the original GAN framework. We apply our method to image generation, image segmentation and audio source separation, and obtain improved performance over a standard GAN when additional incomplete training examples are available. For the Cityscapes segmentation task in particular, our method also improves accuracy by an absolute 14.9% over CycleGAN while using only 25 additional paired examples.
Keyword: Adversarial Learning, Semi-supervised Learning, Image generation, Image segmentation, Missing Data

Decentralized Deep Learning with Arbitrary Communication Compression
Author: Anastasia Koloskova*, Tao Lin*, Sebastian U Stich, Martin Jaggi
link: https://openreview.net/pdf?id=SkgGCkrKvH
Code: https://github.com/epfml/ChocoSGD
Abstract: Decentralized training of deep learning models is a key element for enabling data privacy and on-device learning over networks, as well as for efficient scaling to large compute clusters. As current approaches are limited by network bandwidth, we propose the use of communication compression in the decentralized training context. We show that Choco-SGD achieves linear speedup in the number of workers for arbitrary high compression ratios on general non-convex functions, and non-IID training data. We demonstrate the practical performance of the algorithm in two key scenarios: the training of deep learning models (i) over decentralized user devices, connected by a peer-to-peer network and (ii) in a datacenter.
Keyword: None

Toward Evaluating Robustness of Deep Reinforcement Learning with Continuous Control
Author: Tsui-Wei Weng, Krishnamurthy (Dj) Dvijotham*, Jonathan Uesato*, Kai Xiao*, Sven Gowal*, Robert Stanforth*, Pushmeet Kohli
link: https://openreview.net/pdf?id=SylL0krYPS
Code: None
Abstract: Deep reinforcement learning has achieved great success in many previously difficult reinforcement learning tasks, yet recent studies show that deep RL agents are also unavoidably susceptible to adversarial perturbations, similar to deep neural networks in classification tasks. Prior works mostly focus on model-free adversarial attacks and agents with discrete actions. In this work, we study the problem of continuous control agents in deep RL with adversarial attacks and propose the first two-step algorithm based on learned model dynamics. Extensive experiments on various MuJoCo domains (Cartpole, Fish, Walker, Humanoid) demonstrate that our proposed framework is much more effective and efficient than model-free based attacks baselines in degrading agent performance as well as driving agents to unsafe states.
Keyword: deep learning, reinforcement learning, robustness, adversarial examples

Gradient $\ell_1$ Regularization for Quantization Robustness
Author: Milad Alizadeh, Arash Behboodi, Mart van Baalen, Christos Louizos, Tijmen Blankevoort, Max Welling
link: https://openreview.net/pdf?id=ryxK0JBtPr
Code: None
Abstract: We analyze the effect of quantizing weights and activations of neural networks on their loss and derive a simple regularization scheme that improves robustness against post-training quantization. By training quantization-ready networks, our approach enables storing a single set of weights that can be quantized on-demand to different bit-widths as energy and memory requirements of the application change. Unlike quantization-aware training using the straight-through estimator that only targets a specific bit-width and requires access to training data and pipeline, our regularization-based method paves the way for ``on the fly’’ post-training quantization to various bit-widths. We show that by modeling quantization as a $\ell_\infty$ -bounded perturbation, the first-order term in the loss expansion can be regularized using the $\ell_1$ -norm of gradients. We experimentally validate our method on different vision architectures on CIFAR-10 and ImageNet datasets and show that the regularization of a neural network using our method improves robustness against quantization noise.
Keyword: quantization, regularization, robustness, gradient regularization

SpikeGrad: An ANN-equivalent Computation Model for Implementing Backpropagation with Spikes
Author: Johannes C. Thiele, Olivier Bichler, Antoine Dupret
link: https://openreview.net/pdf?id=rkxs0yHFPH
Code: None
Abstract: Event-based neuromorphic systems promise to reduce the energy consumption of deep neural networks by replacing expensive floating point operations on dense matrices by low energy, sparse operations on spike events. While these systems can be trained increasingly well using approximations of the backpropagation algorithm, this usually requires high precision errors and is therefore incompatible with the typical communication infrastructure of neuromorphic circuits. In this work, we analyze how the gradient can be discretized into spike events when training a spiking neural network. To accelerate our simulation, we show that using a special implementation of the integrate-and-fire neuron allows us to describe the accumulated activations and errors of the spiking neural network in terms of an equivalent artificial neural network, allowing us to largely speed up training compared to an explicit simulation of all spike events. This way we are able to demonstrate that even for deep networks, the gradients can be discretized sufficiently well with spikes if the gradient is properly rescaled. This form of spike-based backpropagation enables us to achieve equivalent or better accuracies on the MNIST and CIFAR10 datasets than comparable state-of-the-art spiking neural networks trained with full precision gradients. The algorithm, which we call SpikeGrad, is based on only accumulation and comparison operations and can naturally exploit sparsity in the gradient computation, which makes it an interesting choice for a spiking neuromorphic systems with on-chip learning capacities.
Keyword: spiking neural network, neuromorphic engineering, backpropagation

On the Relationship between Self-Attention and Convolutional Layers
Author: Jean-Baptiste Cordonnier, Andreas Loukas, Martin Jaggi
link: https://openreview.net/pdf?id=HJlnC1rKPB
Code: https://github.com/epfml/attention-cnn
Abstract: Recent trends of incorporating attention mechanisms in vision have led researchers to reconsider the supremacy of convolutional layers as a primary building block. Beyond helping CNNs to handle long-range dependencies, Ramachandran et al. (2019) showed that attention can completely replace convolution and achieve state-of-the-art performance on vision tasks. This raises the question: do learned attention layers operate similarly to convolutional layers? This work provides evidence that attention layers can perform convolution and, indeed, they often learn to do so in practice. Specifically, we prove that a multi-head self-attention layer with sufficient number of heads is at least as expressive as any convolutional layer. Our numerical experiments then show that self-attention layers attend to pixel-grid patterns similarly to CNN layers, corroborating our analysis. Our code is publicly available.
Keyword: self-attention, attention, transformers, convolution, CNN, image, expressivity, capacity

Learning-Augmented Data Stream Algorithms
Author: Tanqiu Jiang, Yi Li, Honghao Lin, Yisong Ruan, David P. Woodruff
link: https://openreview.net/pdf?id=HyxJ1xBYDH
Code: https://drive.google.com/open?id=1faroW4fFTM7ELVkZDtgiMBjUu1F-piQa
Abstract: The data stream model is a fundamental model for processing massive data sets with limited memory and fast processing time. Recently Hsu et al. (2019) incorporated machine learning techniques into the data stream model in order to learn relevant patterns in the input data. Such techniques were encapsulated by training an oracle to predict item frequencies in the streaming model. In this paper we explore the full power of such an oracle, showing that it can be applied to a wide array of problems in data streams, sometimes resulting in the first optimal bounds for such problems. Namely, we apply the oracle to counting distinct elements on the difference of streams, estimating frequency moments, estimating cascaded aggregates, and estimating moments of geometric data streams. For the distinct elements problem, we obtain the first memory-optimal algorithms. For estimating the $p$ -th frequency moment for $0 < p < 2$ we obtain the first algorithms with optimal update time. For estimating the $p$ -the frequency moment for $p > 2$ we obtain a quadratic saving in memory. We empirically validate our results, demonstrating also our improvements in practice.
Keyword: streaming algorithms, heavy hitters, F_p moment, distinct elements, cascaded norms

Structured Object-Aware Physics Prediction for Video Modeling and Planning
Author: Jannik Kossen, Karl Stelzner, Marcel Hussing, Claas Voelcker, Kristian Kersting
link: https://openreview.net/pdf?id=B1e-kxSKDH
Code: https://github.com/ICLR20/STOVE
Abstract: When humans observe a physical system, they can easily locate components, understand their interactions, and anticipate future behavior, even in settings with complicated and previously unseen interactions. For computers, however, learning such models from videos in an unsupervised fashion is an unsolved research problem. In this paper, we present STOVE, a novel state-space model for videos, which explicitly reasons about objects and their positions, velocities, and interactions. It is constructed by combining an image model and a dynamics model in compositional manner and improves on previous work by reusing the dynamics model for inference, accelerating and regularizing training. STOVE predicts videos with convincing physical behavior over hundreds of timesteps, outperforms previous unsupervised models, and even approaches the performance of supervised baselines. We further demonstrate the strength of our model as a simulator for sample efficient model-based control, in a task with heavily interacting objects.

Keyword: self-supervised learning, probabilistic deep learning, structured models, video prediction, physics prediction, planning, variational auteoncoders, model-based reinforcement learning, VAEs, unsupervised, variational, graph neural networks, tractable probabilistic models, attend-infer-repeat, relational learning, AIR, sum-product networks, object-oriented, object-centric, object-aware, MCTS

Incorporating BERT into Neural Machine Translation
Author: Jinhua Zhu, Yingce Xia, Lijun Wu, Di He, Tao Qin, Wengang Zhou, Houqiang Li, Tieyan Liu
link: https://openreview.net/pdf?id=Hyl7ygStwB
Code: https://github.com/bert-nmt/bert-nmt
Abstract: The recently proposed BERT (Devlin et al., 2019) has shown great power on a variety of natural language understanding tasks, such as text classification, reading comprehension, etc. However, how to effectively apply BERT to neural machine translation (NMT) lacks enough exploration. While BERT is more commonly used as fine-tuning instead of contextual embedding for downstream language understanding tasks, in NMT, our preliminary exploration of using BERT as contextual embedding is better than using for fine-tuning. This motivates us to think how to better leverage BERT for NMT along this direction. We propose a new algorithm named BERT-fused model, in which we first use BERT to extract representations for an input sequence, and then the representations are fused with each layer of the encoder and decoder of the NMT model through attention mechanisms. We conduct experiments on supervised (including sentence-level and document-level translations), semi-supervised and unsupervised machine translation, and achieve state-of-the-art results on seven benchmark datasets. Our code is available at
Keyword: BERT, Neural Machine Translation

MMA Training: Direct Input Space Margin Maximization through Adversarial Training
Author: Gavin Weiguang Ding, Yash Sharma, Kry Yik Chau Lui, Ruitong Huang
link: https://openreview.net/pdf?id=HkeryxBtPB
Code: https://github.com/BorealisAI/mma_training
Abstract: We study adversarial robustness of neural networks from a margin maximization perspective, where margins are defined as the distances from inputs to a classifier’s decision boundary.
Our study shows that maximizing margins can be achieved by minimizing the adversarial loss on the decision boundary at the “shortest successful perturbation”, demonstrating a close connection between adversarial losses and the margins. We propose Max-Margin Adversarial (MMA) training to directly maximize the margins to achieve adversarial robustness.
Instead of adversarial training with a fixed $\epsilon$ , MMA offers an improvement by enabling adaptive selection of the “correct” $\epsilon$ as the margin individually for each datapoint. In addition, we rigorously analyze adversarial training with the perspective of margin maximization, and provide an alternative interpretation for adversarial training, maximizing either a lower or an upper bound of the margins. Our experiments empirically confirm our theory and demonstrate MMA training’s efficacy on the MNIST and CIFAR10 datasets w.r.t. $\ell_\infty$ and $\ell_2$ robustness.
Keyword: adversarial robustness, perturbation, margin maximization, deep learning

Infinite-horizon Off-Policy Policy Evaluation with Multiple Behavior Policies
Author: Xinyun Chen, Lu Wang, Yizhe Hang, Heng Ge, Hongyuan Zha
link: https://openreview.net/pdf?id=rkgU1gHtvr
Code: None
Abstract: We consider off-policy policy evaluation when the trajectory data are generated by multiple behavior policies. Recent work has shown the key role played by the state or state-action stationary distribution corrections in the infinite horizon context for off-policy policy evaluation. We propose estimated mixture policy (EMP), a novel class of partially policy-agnostic methods to accurately estimate those quantities. With careful analysis, we show that EMP gives rise to estimates with reduced variance for estimating the state stationary distribution correction while it also offers a useful induction bias for estimating the state-action stationary distribution correction. In extensive experiments with both continuous and discrete environments, we demonstrate that our algorithm offers significantly improved accuracy compared to the state-of-the-art methods.
Keyword: off-policy policy evaluation, multiple importance sampling, kernel method, variance reduction

vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations
Author: Alexei Baevski, Steffen Schneider, Michael Auli
link: https://openreview.net/pdf?id=rylwJxrYDS
Code: None
Abstract: We propose vq-wav2vec to learn discrete representations of audio segments through a wav2vec-style self-supervised context prediction task. The algorithm uses either a gumbel softmax or online k-means clustering to quantize the dense representations. Discretization enables the direct application of algorithms from the NLP community which require discrete inputs. Experiments show that BERT pre-training achieves a new state of the art on TIMIT phoneme classification and WSJ speech recognition.
Keyword: speech recognition, speech representation learning

Meta-learning curiosity algorithms
Author: Ferran Alet*, Martin F. Schneider*, Tomas Lozano-Perez, Leslie Pack Kaelbling
link: https://openreview.net/pdf?id=BygdyxHFDS
Code: https://github.com/mfranzs/meta-learning-curiosity-algorithms
Abstract: We take the hypothesis that curiosity is a mechanism found by evolution that encourages meaningful exploration early in an agent’s life in order to expose it to experiences that enable it to obtain high rewards over the course of its lifetime. We formulate the problem of generating curious behavior as one of meta-learning: an outer loop will search over a space of curiosity mechanisms that dynamically adapt the agent’s reward signal, and an inner loop will perform standard reinforcement learning using the adapted reward signal. However, current meta-RL methods based on transferring neural network weights have only generalized between very similar tasks. To broaden the generalization, we instead propose to meta-learn algorithms: pieces of code similar to those designed by humans in ML papers. Our rich language of programs combines neural networks with other building blocks such as buffers, nearest-neighbor modules and custom loss functions. We demonstrate the effectiveness of the approach empirically, finding two novel curiosity algorithms that perform on par or better than human-designed published curiosity algorithms in domains as disparate as grid navigation with image input, acrobot, lunar lander, ant and hopper.
Keyword: meta-learning, exploration, curiosity

Making Efficient Use of Demonstrations to Solve Hard Exploration Problems
Author: Caglar Gulcehre, Tom Le Paine, Bobak Shahriari, Misha Denil, Matt Hoffman, Hubert Soyer, Richard Tanburn, Steven Kapturowski, Neil Rabinowitz, Duncan Williams, Gabriel Barth-Maron, Ziyu Wang, Nando de Freitas, Worlds Team
link: https://openreview.net/pdf?id=SygKyeHKDH
Code: None
Abstract: This paper introduces R2D3, an agent that makes efficient use of demonstrations to solve hard exploration problems in partially observable environments with highly variable initial conditions. We also introduce a suite of eight tasks that combine these three properties, and show that R2D3 can solve several of the tasks where other state of the art methods (both with and without demonstrations) fail to see even a single successful trajectory after tens of billions of steps of exploration.
Keyword: imitation learning, deep learning, reinforcement learning

VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning
Author: Luisa Zintgraf, Kyriacos Shiarlis, Maximilian Igl, Sebastian Schulze, Yarin Gal, Katja Hofmann, Shimon Whiteson
link: https://openreview.net/pdf?id=Hkl9JlBYvr
Code: None
Abstract: Trading off exploration and exploitation in an unknown environment is key to maximising expected return during learning. A Bayes-optimal policy, which does so optimally, conditions its actions not only on the environment state but on the agent’s uncertainty about the environment. Computing a Bayes-optimal policy is however intractable for all but the smallest tasks. In this paper, we introduce variational Bayes-Adaptive Deep RL (variBAD), a way to meta-learn to perform approximate inference in an unknown environment, and incorporate task uncer- tainty directly during action selection. In a grid-world domain, we illustrate how variBAD performs structured online exploration as a function of task uncertainty. We further evaluate variBAD on MuJoCo domains widely used in meta-RL and show that it achieves higher online return than existing methods.
Keyword: Meta-Learning, Bayesian Reinforcement Learning, BAMDPs, Deep Reinforcement Learning

Lookahead: A Far-sighted Alternative of Magnitude-based Pruning
Author: Sejun Park*, Jaeho Lee*, Sangwoo Mo, Jinwoo Shin
link: https://openreview.net/pdf?id=ryl3ygHYDB
Code: https://github.com/alinlab/lookahead_pruning
Abstract: Magnitude-based pruning is one of the simplest methods for pruning neural networks. Despite its simplicity, magnitude-based pruning and its variants demonstrated remarkable performances for pruning modern architectures. Based on the observation that magnitude-based pruning indeed minimizes the Frobenius distortion of a linear operator corresponding to a single layer, we develop a simple pruning method, coined lookahead pruning, by extending the single layer optimization to a multi-layer optimization. Our experimental results demonstrate that the proposed method consistently outperforms magnitude-based pruning on various networks, including VGG and ResNet, particularly in the high-sparsity regime. See
Keyword: network magnitude-based pruning

Spike-based causal inference for weight alignment
Author: Jordan Guerguiev, Konrad Kording, Blake Richards
link: https://openreview.net/pdf?id=rJxWxxSYvB
Code: https://anonfile.com/51V8Ge66n3/Code_zip
Abstract: In artificial neural networks trained with gradient descent, the weights used for processing stimuli are also used during backward passes to calculate gradients. For the real brain to approximate gradients, gradient information would have to be propagated separately, such that one set of synaptic weights is used for processing and another set is used for backward passes. This produces the so-called “weight transport problem” for biological models of learning, where the backward weights used to calculate gradients need to mirror the forward weights used to process stimuli. This weight transport problem has been considered so hard that popular proposals for biological learning assume that the backward weights are simply random, as in the feedback alignment algorithm. However, such random weights do not appear to work well for large networks. Here we show how the discontinuity introduced in a spiking system can lead to a solution to this problem. The resulting algorithm is a special case of an estimator used for causal inference in econometrics, regression discontinuity design. We show empirically that this algorithm rapidly makes the backward weights approximate the forward weights. As the backward weights become correct, this improves learning performance over feedback alignment on tasks such as Fashion-MNIST and CIFAR-10. Our results demonstrate that a simple learning rule in a spiking network can allow neurons to produce the right backward connections and thus solve the weight transport problem.
Keyword: causal, inference, weight, transport, rdd, regression, discontinuity, design, cifar10, biologically, plausible

Empirical Bayes Transductive Meta-Learning with Synthetic Gradients
Author: Shell Xu Hu, Pablo Moreno, Yang Xiao, Xi Shen, Guillaume Obozinski, Neil Lawrence, Andreas Damianou
link: https://openreview.net/pdf?id=Hkg-xgrYvH
Code: https://github.com/hushell/sib_meta_learn
Abstract: We propose a meta-learning approach that learns from multiple tasks in a transductive setting, by leveraging the unlabeled query set in addition to the support set to generate a more powerful model for each task. To develop our framework, we revisit the empirical Bayes formulation for multi-task learning. The evidence lower bound of the marginal log-likelihood of empirical Bayes decomposes as a sum of local KL divergences between the variational posterior and the true posterior on the query set of each task.
We derive a novel amortized variational inference that couples all the variational posteriors via a meta-model, which consists of a synthetic gradient network and an initialization network. Each variational posterior is derived from synthetic gradient descent to approximate the true posterior on the query set, although where we do not have access to the true gradient.
Our results on the Mini-ImageNet and CIFAR-FS benchmarks for episodic few-shot classification outperform previous state-of-the-art methods. Besides, we conduct two zero-shot learning experiments to further explore the potential of the synthetic gradient.
Keyword: Meta-learning, Empirical Bayes, Synthetic Gradient, Information Bottleneck

Keep Doing What Worked: Behavior Modelling Priors for Offline Reinforcement Learning
Author: Noah Siegel, Jost Tobias Springenberg, Felix Berkenkamp, Abbas Abdolmaleki, Michael Neunert, Thomas Lampe, Roland Hafner, Nicolas Heess, Martin Riedmiller
link: https://openreview.net/pdf?id=rke7geHtwH
Code: None
Abstract: Off-policy reinforcement learning algorithms promise to be applicable in settings where only a fixed data-set (batch) of environment interactions is available and no new experience can be acquired. This property makes these algorithms appealing for real world problems such as robot control. In practice, however, standard off-policy algorithms fail in the batch setting for continuous control. In this paper, we propose a simple solution to this problem. It admits the use of data generated by arbitrary behavior policies and uses a learned prior – the advantage-weighted behavior model (ABM) – to bias the RL policy towards actions that have previously been executed and are likely to be successful on the new task. Our method can be seen as an extension of recent work on batch-RL that enables stable learning from conflicting data-sources. We find improvements on competitive baselines in a variety of RL tasks – including standard continuous control benchmarks and multi-task learning for simulated and real-world robots.
Keyword: Reinforcement Learning, Off-policy, Multitask, Continuous Control

Understanding the Limitations of Conditional Generative Models
Author: Ethan Fetaya, Joern-Henrik Jacobsen, Will Grathwohl, Richard Zemel
link: https://openreview.net/pdf?id=r1lPleBFvH
Code: None
Abstract: Class-conditional generative models hold promise to overcome the shortcomings of their discriminative counterparts. They are a natural choice to solve discriminative tasks in a robust manner as they jointly optimize for predictive performance and accurate modeling of the input distribution. In this work, we investigate robust classification with likelihood-based generative models from a theoretical and practical perspective to investigate if they can deliver on their promises. Our analysis focuses on a spectrum of robustness properties: (1) Detection of worst-case outliers in the form of adversarial examples; (2) Detection of average-case outliers in the form of ambiguous inputs and (3) Detection of incorrectly labeled in-distribution inputs.

  Our theoretical result reveals that it is impossible to guarantee detectability of adversarially-perturbed inputs even for near-optimal generative classifiers. Experimentally, we find that while we are able to train robust models for MNIST, robustness completely breaks down on CIFAR10. We relate this failure to various undesirable model properties that can be traced to the maximum likelihood training objective. Despite being a common choice in the literature, our results indicate that likelihood-based conditional generative models may are surprisingly ineffective for robust classification.

Keyword: Conditional Generative Models, Generative Classifiers, Robustness, Adversarial Examples

Demystifying Inter-Class Disentanglement
Author: Aviv Gabbay, Yedid Hoshen
link: https://openreview.net/pdf?id=Hyl9xxHYPr
Code: https://github.com/avivga/lord
Abstract: Learning to disentangle the hidden factors of variations within a set of observations is a key task for artificial intelligence. We present a unified formulation for class and content disentanglement and use it to illustrate the limitations of current methods. We therefore introduce LORD, a novel method based on Latent Optimization for Representation Disentanglement. We find that latent optimization, along with an asymmetric noise regularization, is superior to amortized inference for achieving disentangled representations. In extensive experiments, our method is shown to achieve better disentanglement performance than both adversarial and non-adversarial methods that use the same level of supervision. We further introduce a clustering-based approach for extending our method for settings that exhibit in-class variation with promising results on the task of domain translation.
Keyword: disentanglement, latent optimization, domain translation

Mixed-curvature Variational Autoencoders
Author: Ondrej Skopek, Octavian-Eugen Ganea, Gary Bécigneul
link: https://openreview.net/pdf?id=S1g6xeSKDS
Code: https://github.com/oskopek/mvae
Abstract: Euclidean space has historically been the typical workhorse geometry for machine learning applications due to its power and simplicity. However, it has recently been shown that geometric spaces with constant non-zero curvature improve representations and performance on a variety of data types and downstream tasks. Consequently, generative models like Variational Autoencoders (VAEs) have been successfully generalized to elliptical and hyperbolic latent spaces. While these approaches work well on data with particular kinds of biases e.g. tree-like data for a hyperbolic VAE, there exists no generic approach unifying and leveraging all three models. We develop a Mixed-curvature Variational Autoencoder, an efficient way to train a VAE whose latent space is a product of constant curvature Riemannian manifolds, where the per-component curvature is fixed or learnable. This generalizes the Euclidean VAE to curved latent spaces and recovers it when curvatures of all latent space components go to 0.
Keyword: variational autoencoders, riemannian manifolds, non-Euclidean geometry

BinaryDuo: Reducing Gradient Mismatch in Binary Activation Network by Coupling Binary Activations
Author: Hyungjun Kim, Kyungsu Kim, Jinseok Kim, Jae-Joon Kim
link: https://openreview.net/pdf?id=r1x0lxrFPS
Code: https://github.com/Hyungjun-K1m/BinaryDuo
Abstract: Binary Neural Networks (BNNs) have been garnering interest thanks to their compute cost reduction and memory savings. However, BNNs suffer from performance degradation mainly due to the gradient mismatch caused by binarizing activations. Previous works tried to address the gradient mismatch problem by reducing the discrepancy between activation functions used at forward pass and its differentiable approximation used at backward pass, which is an indirect measure. In this work, we use the gradient of smoothed loss function to better estimate the gradient mismatch in quantized neural network. Analysis using the gradient mismatch estimator indicates that using higher precision for activation is more effective than modifying the differentiable approximation of activation function. Based on the observation, we propose a new training scheme for binary activation networks called BinaryDuo in which two binary activations are coupled into a ternary activation during training. Experimental results show that BinaryDuo outperforms state-of-the-art BNNs on various benchmarks with the same amount of parameters and computing cost.
Keyword: None

Model-based reinforcement learning for biological sequence design
Author: Christof Angermueller, David Dohan, David Belanger, Ramya Deshpande, Kevin Murphy, Lucy Colwell
link: https://openreview.net/pdf?id=HklxbgBKvr
Code: None
Abstract: The ability to design biological structures such as DNA or proteins would have considerable medical and industrial impact. Doing so presents a challenging black-box optimization problem characterized by the large-batch, low round setting due to the need for labor-intensive wet lab evaluations. In response, we propose using reinforcement learning (RL) based on proximal-policy optimization (PPO) for biological sequence design. RL provides a flexible framework for optimization generative sequence models to achieve specific criteria, such as diversity among the high-quality sequences discovered. We propose a model-based variant of PPO, DyNA-PPO, to improve sample efficiency, where the policy for a new round is trained offline using a simulator fit on functional measurements from prior rounds. To accommodate the growing number of observations across rounds, the simulator model is automatically selected at each round from a pool of diverse models of varying capacity. On the tasks of designing DNA transcription factor binding sites, designing antimicrobial proteins, and optimizing the energy of Ising models based on protein structure, we find that DyNA-PPO performs significantly better than existing methods in settings in which modeling is feasible, while still not performing worse in situations in which a reliable model cannot be learned.
Keyword: reinforcement learning, blackbox optimization, molecule design

BayesOpt Adversarial Attack
Author: Binxin Ru, Adam Cobb, Arno Blaas, Yarin Gal
link: https://openreview.net/pdf?id=Hkem-lrtvH
Code: https://github.com/rubinxin/BayesOpt_Attack
Abstract: Black-box adversarial attacks require a large number of attempts before finding successful adversarial examples that are visually indistinguishable from the original input. Current approaches relying on substitute model training, gradient estimation or genetic algorithms often require an excessive number of queries. Therefore, they are not suitable for real-world systems where the maximum query number is limited due to cost. We propose a query-efficient black-box attack which uses Bayesian optimisation in combination with Bayesian model selection to optimise over the adversarial perturbation and the optimal degree of search space dimension reduction. We demonstrate empirically that our method can achieve comparable success rates with 2-5 times fewer queries compared to previous state-of-the-art black-box attacks.
Keyword: Black-box Adversarial Attack, Bayesian Optimisation, Gaussian Process

Meta Reinforcement Learning with Autonomous Inference of Subtask Dependencies
Author: Sungryull Sohn, Hyunjae Woo, Jongwook Choi, Honglak Lee
link: https://openreview.net/pdf?id=HkgsWxrtPB
Code: None
Abstract: We propose and address a novel few-shot RL problem, where a task is characterized by a subtask graph which describes a set of subtasks and their dependencies that are unknown to the agent. The agent needs to quickly adapt to the task over few episodes during adaptation phase to maximize the return in the test phase. Instead of directly learning a meta-policy, we develop a Meta-learner with Subtask Graph Inference (MSGI), which infers the latent parameter of the task by interacting with the environment and maximizes the return given the latent parameter. To facilitate learning, we adopt an intrinsic reward inspired by upper confidence bound (UCB) that encourages efficient exploration. Our experiment results on two grid-world domains and StarCraft II environments show that the proposed method is able to accurately infer the latent task parameter, and to adapt more efficiently than existing meta RL and hierarchical RL methods.
Keyword: Meta reinforcement learning, subtask graph

Hypermodels for Exploration
Author: Vikranth Dwaracherla, Xiuyuan Lu, Morteza Ibrahimi, Ian Osband, Zheng Wen, Benjamin Van Roy
link: https://openreview.net/pdf?id=ryx6WgStPB
Code: None
Abstract: We study the use of hypermodels to represent epistemic uncertainty and guide exploration.
This generalizes and extends the use of ensembles to approximate Thompson sampling. The computational cost of training an ensemble grows with its size, and as such, prior work has typically been limited to ensembles with tens of elements. We show that alternative hypermodels can enjoy dramatic efficiency gains, enabling behavior that would otherwise require hundreds or thousands of elements, and even succeed in situations where ensemble methods fail to learn regardless of size.
This allows more accurate approximation of Thompson sampling as well as use of more sophisticated exploration schemes. In particular, we consider an approximate form of information-directed sampling and demonstrate performance gains relative to Thompson sampling. As alternatives to ensembles, we consider linear and neural network hypermodels, also known as hypernetworks.
We prove that, with neural network base models, a linear hypermodel can represent essentially any distribution over functions, and as such, hypernetworks do not extend what can be represented.
Keyword: exploration, hypermodel, reinforcement learning

RaPP: Novelty Detection with Reconstruction along Projection Pathway
Author: Ki Hyun Kim, Sangwoo Shim, Yongsub Lim, Jongseob Jeon, Jeongwoo Choi, Byungchan Kim, Andre S. Yoon
link: https://openreview.net/pdf?id=HkgeGeBYDB
Code: https://drive.google.com/drive/folders/1sknl_i4zmvSsPYZdzYxbg66ZSYDZ_abg?usp=sharing
Abstract: We propose RaPP, a new methodology for novelty detection by utilizing hidden space activation values obtained from a deep autoencoder.
Precisely, RaPP compares input and its autoencoder reconstruction not only in the input space but also in the hidden spaces.
We show that if we feed a reconstructed input to the same autoencoder again, its activated values in a hidden space are equivalent to the corresponding reconstruction in that hidden space given the original input.
In order to aggregate the hidden space activation values, we propose two metrics, which enhance the novelty detection performance.
Through extensive experiments using diverse datasets, we validate that RaPP improves novelty detection performances of autoencoder-based approaches.
Besides, we show that RaPP outperforms recent novelty detection methods evaluated on popular benchmarks.

Keyword: Novelty Detection, Anomaly Detection, Outlier Detection, Semi-supervised Learning

Dynamics-Aware Embeddings
Author: William Whitney, Rajat Agarwal, Kyunghyun Cho, Abhinav Gupta
link: https://openreview.net/pdf?id=BJgZGeHFPH
Code: https://github.com/dyne-submission/dynamics-aware-embeddings
Abstract: In this paper we consider self-supervised representation learning to improve sample efficiency in reinforcement learning (RL). We propose a forward prediction objective for simultaneously learning embeddings of states and actions. These embeddings capture the structure of the environment’s dynamics, enabling efficient policy learning. We demonstrate that our action embeddings alone improve the sample efficiency and peak performance of model-free RL on control from low-dimensional states. By combining state and action embeddings, we achieve efficient learning of high-quality policies on goal-conditioned continuous control from pixel observations in only 1-2 million environment steps.
Keyword: representation learning, reinforcement learning, rl

Functional Regularisation for Continual Learning with Gaussian Processes
Author: Michalis K. Titsias, Jonathan Schwarz, Alexander G. de G. Matthews, Razvan Pascanu, Yee Whye Teh
link: https://openreview.net/pdf?id=HkxCzeHFDB
Code: None
Abstract: We introduce a framework for Continual Learning (CL) based on Bayesian inference over the function space rather than the parameters of a deep neural network. This method, referred to as functional regularisation for Continual Learning, avoids forgetting a previous task by constructing and memorising an approximate posterior belief over the underlying task-specific function. To achieve this we rely on a Gaussian process obtained by treating the weights of the last layer of a neural network as random and Gaussian distributed. Then, the training algorithm sequentially encounters tasks and constructs posterior beliefs over the task-specific functions by using inducing point sparse Gaussian process methods. At each step a new task is first learnt and then a summary is constructed consisting of (i) inducing inputs – a fixed-size subset of the task inputs selected such that it optimally represents the task – and (ii) a posterior distribution over the function values at these inputs. This summary then regularises learning of future tasks, through Kullback-Leibler regularisation terms. Our method thus unites approaches focused on (pseudo-)rehearsal with those derived from a sequential Bayesian inference perspective in a principled way, leading to strong results on accepted benchmarks.
Keyword: Continual Learning, Gaussian Processes, Lifelong learning, Incremental Learning

You CAN Teach an Old Dog New Tricks! On Training Knowledge Graph Embeddings
Author: Daniel Ruffinelli, Samuel Broscheit, Rainer Gemulla
link: https://openreview.net/pdf?id=BkxSmlBFvr
Code: None
Abstract: Knowledge graph embedding (KGE) models learn algebraic representations of the entities and relations in a knowledge graph. A vast number of KGE techniques for multi-relational link prediction have been proposed in the recent literature, often with state-of-the-art performance. These approaches differ along a number of dimensions, including different model architectures, different training strategies, and different approaches to hyperparameter optimization. In this paper, we take a step back and aim to summarize and quantify empirically the impact of each of these dimensions on model performance. We report on the results of an extensive experimental study with popular model architectures and training strategies across a wide range of hyperparameter settings. We found that when trained appropriately, the relative performance differences between various model architectures often shrinks and sometimes even reverses when compared to prior results. For example, RESCAL~\citep{nickel2011three}, one of the first KGE models, showed strong performance when trained with state-of-the-art techniques; it was competitive to or outperformed more recent architectures. We also found that good (and often superior to prior studies) model configurations can be found by exploring relatively few random samples from a large hyperparameter space. Our results suggest that many of the more advanced architectures and techniques proposed in the literature should be revisited to reassess their individual benefits. To foster further reproducible research, we provide all our implementations and experimental results as part of the open source LibKGE framework.
Keyword: knowledge graph embeddings, hyperparameter optimization

**AdvectiveNet: An Eulerian-Lagrangian Fluidic Reservoir for Point Cloud Processing **
Author: Xingzhe He, Helen Lu Cao, Bo Zhu
link: https://openreview.net/pdf?id=H1eqQeHFDS
Code: https://github.com/DIUDIUDIUDIUDIU/AdvectiveNet-An-Eulerian-Lagrangian-Fluidic-Reservoir-for-Point-Cloud-Processing
Abstract: This paper presents a novel physics-inspired deep learning approach for point cloud processing motivated by the natural flow phenomena in fluid mechanics. Our learning architecture jointly defines data in an Eulerian world space, using a static background grid, and a Lagrangian material space, using moving particles. By introducing this Eulerian-Lagrangian representation, we are able to naturally evolve and accumulate particle features using flow velocities generated from a generalized, high-dimensional force field. We demonstrate the efficacy of this system by solving various point cloud classification and segmentation problems with state-of-the-art performance. The entire geometric reservoir and data flow mimic the pipeline of the classic PIC/FLIP scheme in modeling natural flow, bridging the disciplines of geometric machine learning and physical simulation.
Keyword: Point Cloud Processing, Physical Reservoir Learning, Eulerian-Lagrangian Method, PIC/FLIP

Never Give Up: Learning Directed Exploration Strategies
Author: Adrià Puigdomènech Badia, Pablo Sprechmann, Alex Vitvitskyi, Daniel Guo, Bilal Piot, Steven Kapturowski, Olivier Tieleman, Martin Arjovsky, Alexander Pritzel, Andrew Bolt, Charles Blundell
link: https://openreview.net/pdf?id=Sye57xStvB
Code: None
Abstract: We propose a reinforcement learning agent to solve hard exploration games by learning a range of directed exploratory policies. We construct an episodic memory-based intrinsic reward using k-nearest neighbors over the agent’s recent experience to train the directed exploratory policies, thereby encouraging the agent to repeatedly revisit all states in its environment. A self-supervised inverse dynamics model is used to train the embeddings of the nearest neighbour lookup, biasing the novelty signal towards what the agent can control. We employ the framework of Universal Value Function Approximators to simultaneously learn many directed exploration policies with the same neural network, with different trade-offs between exploration and exploitation. By using the same neural network for different degrees of exploration/exploitation, transfer is demonstrated from predominantly exploratory policies yielding effective exploitative policies. The proposed method can be incorporated to run with modern distributed RL agents that collect large amounts of experience from many actors running in parallel on separate environment instances. Our method doubles the performance of the base agent in all hard exploration in the Atari-57 suite while maintaining a very high score across the remaining games, obtaining a median human normalised score of 1344.0%. Notably, the proposed method is the first algorithm to achieve non-zero rewards (with a mean score of 8,400) in the game of Pitfall! without using demonstrations or hand-crafted features.
Keyword: deep reinforcement learning, exploration, intrinsic motivation

Fair Resource Allocation in Federated Learning
Author: Tian Li, Maziar Sanjabi, Ahmad Beirami, Virginia Smith
link: https://openreview.net/pdf?id=ByexElSYDr
Code: https://github.com/litian96/fair_flearn
Abstract: Federated learning involves training statistical models in massive, heterogeneous networks. Naively minimizing an aggregate loss function in such a network may disproportionately advantage or disadvantage some of the devices. In this work, we propose q-Fair Federated Learning (q-FFL), a novel optimization objective inspired by fair resource allocation in wireless networks that encourages a more fair (specifically, a more uniform) accuracy distribution across devices in federated networks. To solve q-FFL, we devise a communication-efficient method, q-FedAvg, that is suited to federated networks. We validate both the effectiveness of q-FFL and the efficiency of q-FedAvg on a suite of federated datasets with both convex and non-convex models, and show that q-FFL (along with q-FedAvg) outperforms existing baselines in terms of the resulting fairness, flexibility, and efficiency.
Keyword: federated learning, fairness, distributed optimization

Smooth markets: A basic mechanism for organizing gradient-based learners
Author: David Balduzzi, Wojciech M. Czarnecki, Tom Anthony, Ian Gemp, Edward Hughes, Joel Leibo, Georgios Piliouras, Thore Graepel
link: https://openreview.net/pdf?id=B1xMEerYvB
Code: None
Abstract: With the success of modern machine learning, it is becoming increasingly important to understand and control how learning algorithms interact. Unfortunately, negative results from game theory show there is little hope of understanding or controlling general n-player games. We therefore introduce smooth markets (SM-games), a class of n-player games with pairwise zero sum interactions. SM-games codify a common design pattern in machine learning that includes some GANs, adversarial training, and other recent algorithms. We show that SM-games are amenable to analysis and optimization using first-order methods.
Keyword: game theory, optimization, gradient descent, adversarial learning

StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding
Author: Wei Wang, Bin Bi, Ming Yan, Chen Wu, Jiangnan Xia, Zuyi Bao, Liwei Peng, Luo Si
link: https://openreview.net/pdf?id=BJgQ4lSFPH
Code: None
Abstract: Recently, the pre-trained language model, BERT (and its robustly optimized version RoBERTa), has attracted a lot of attention in natural language understanding (NLU), and achieved state-of-the-art accuracy in various NLU tasks, such as sentiment classification, natural language inference, semantic textual similarity and question answering. Inspired by the linearization exploration work of Elman, we extend BERT to a new model, StructBERT, by incorporating language structures into pre-training. Specifically, we pre-train StructBERT with two auxiliary tasks to make the most of the sequential order of words and sentences, which leverage language structures at the word and sentence levels, respectively. As a result, the new model is adapted to different levels of language understanding required by downstream tasks.

  The StructBERT with structural pre-training gives surprisingly good empirical results on a variety of downstream tasks, including pushing the state-of-the-art on the GLUE benchmark to 89.0 (outperforming all published models at the time of model submission), the F1 score on SQuAD v1.1 question answering to 93.0, the accuracy on SNLI to 91.7.

Keyword: None

Training binary neural networks with real-to-binary convolutions
Author: Brais Martinez, Jing Yang, Adrian Bulat, Georgios Tzimiropoulos
link: https://openreview.net/pdf?id=BJg4NgBKvH
Code: None
Abstract: This paper shows how to train binary networks to within a few percent points (~3-5%) of the full precision counterpart. We first show how to build a strong baseline, which already achieves state-of-the-art accuracy, by combining recently proposed advances and carefully adjusting the optimization procedure. Secondly, we show that by attempting to minimize the discrepancy between the output of the binary and the corresponding real-valued convolution, additional significant accuracy gains can be obtained. We materialize this idea in two complementary ways: (1) with a loss function, during training, by matching the spatial attention maps computed at the output of the binary and real-valued convolutions, and (2) in a data-driven manner, by using the real-valued activations, available during inference prior to the binarization process, for re-scaling the activations right after the binary convolution. Finally, we show that, when putting all of our improvements together, the proposed model beats the current state of the art by more than 5% top-1 accuracy on ImageNet and reduces the gap to its real-valued counterpart to less than 3% and 5% top-1 accuracy on CIFAR-100 and ImageNet respectively when using a ResNet-18 architecture. Code available at
Keyword: binary networks

Permutation Equivariant Models for Compositional Generalization in Language
Author: Jonathan Gordon, David Lopez-Paz, Marco Baroni, Diane Bouchacourt
link: https://openreview.net/pdf?id=SylVNerFvr
Code: https://github.com/facebookresearch/Permutation-Equivariant-Seq2Seq
Abstract: Humans understand novel sentences by composing meanings and roles of core language components. In contrast, neural network models for natural language modeling fail when such compositional generalization is required. The main contribution of this paper is to hypothesize that language compositionality is a form of group-equivariance. Based on this hypothesis, we propose a set of tools for constructing equivariant sequence-to-sequence models. Throughout a variety of experiments on the SCAN tasks, we analyze the behavior of existing models under the lens of equivariance, and demonstrate that our equivariant architecture is able to achieve the type compositional generalization required in human language understanding.
Keyword: Compositionality, Permutation Equivariance, Language Processing

Continual learning with hypernetworks
Author: Johannes von Oswald, Christian Henning, João Sacramento, Benjamin F. Grewe
link: https://openreview.net/pdf?id=SJgwNerKvB
Code: https://github.com/chrhenning/hypercl
Abstract: Artificial neural networks suffer from catastrophic forgetting when they are sequentially trained on multiple tasks. To overcome this problem, we present a novel approach based on task-conditioned hypernetworks, i.e., networks that generate the weights of a target model based on task identity. Continual learning (CL) is less difficult for this class of models thanks to a simple key feature: instead of recalling the input-output relations of all previously seen data, task-conditioned hypernetworks only require rehearsing task-specific weight realizations, which can be maintained in memory using a simple regularizer. Besides achieving state-of-the-art performance on standard CL benchmarks, additional experiments on long task sequences reveal that task-conditioned hypernetworks display a very large capacity to retain previous memories. Notably, such long memory lifetimes are achieved in a compressive regime, when the number of trainable hypernetwork weights is comparable or smaller than target network size. We provide insight into the structure of low-dimensional task embedding spaces (the input space of the hypernetwork) and show that task-conditioned hypernetworks demonstrate transfer learning. Finally, forward information transfer is further supported by empirical results on a challenging CL benchmark based on the CIFAR-10/100 image datasets.
Keyword: Continual Learning, Catastrophic Forgetting, Meta Model, Hypernetwork

Phase Transitions for the Information Bottleneck in Representation Learning
Author: Tailin Wu, Ian Fischer
link: https://openreview.net/pdf?id=HJloElBYvB
Code: None
Abstract: In the Information Bottleneck (IB), when tuning the relative strength between compression and prediction terms, how do the two terms behave, and what’s their relationship with the dataset and the learned representation? In this paper, we set out to answer these questions by studying multiple phase transitions in the IB objective: IB_β[p(z|x)] = I(X; Z) − βI(Y; Z) defined on the encoding distribution p(z|x) for input X, target Y and representation Z, where sudden jumps of dI(Y; Z)/dβ and prediction accuracy are observed with increasing β. We introduce a definition for IB phase transitions as a qualitative change of the IB loss landscape, and show that the transitions correspond to the onset of learning new classes. Using second-order calculus of variations, we derive a formula that provides a practical condition for IB phase transitions, and draw its connection with the Fisher information matrix for parameterized models. We provide two perspectives to understand the formula, revealing that each IB phase transition is finding a component of maximum (nonlinear) correlation between X and Y orthogonal to the learned representation, in close analogy with canonical-correlation analysis (CCA) in linear settings. Based on the theory, we present an algorithm for discovering phase transition points. Finally, we verify that our theory and algorithm accurately predict phase transitions in categorical datasets, predict the onset of learning new classes and class difficulty in MNIST, and predict prominent phase transitions in CIFAR10.

Keyword: Information Theory, Representation Learning, Phase Transition

Variational Template Machine for Data-to-Text Generation
Author: Rong Ye, Wenxian Shi, Hao Zhou, Zhongyu Wei, Lei Li
link: https://openreview.net/pdf?id=HkejNgBtPB
Code: None
Abstract: How to generate descriptions from structured data organized in tables? Existing approaches using neural encoder-decoder models often suffer from lacking diversity. We claim that an open set of templates is crucial for enriching the phrase constructions and realizing varied generations.Learning such templates is prohibitive since it often requires a large paired , which is seldom available. This paper explores the problem of automatically learning reusable “templates” from paired and non-paired data. We propose the variational template machine (VTM), a novel method to generate text descriptions from data tables. Our contributions include: a) we carefully devise a specific model architecture and losses to explicitly disentangle text template and semantic content information, in the latent spaces, and b) we utilize both small parallel data and large raw text without aligned tables to enrich the template learning. Experiments on datasets from a variety of different domains show that VTM is able to generate more diversely while keeping a good fluency and quality.
Keyword: None

Memory-Based Graph Networks
Author: Amir Hosein Khasahmadi, Kaveh Hassani, Parsa Moradi, Leo Lee, Quaid Morris
link: https://openreview.net/pdf?id=r1laNeBYPB
Code: https://github.com/amirkhas/GraphMemoryNet
Abstract: Graph neural networks (GNNs) are a class of deep models that operate on data with arbitrary topology represented as graphs. We introduce an efficient memory layer for GNNs that can jointly learn node representations and coarsen the graph. We also introduce two new networks based on this layer: memory-based GNN (MemGNN) and graph memory network (GMN) that can learn hierarchical graph representations. The experimental results shows that the proposed models achieve state-of-the-art results in eight out of nine graph classification and regression benchmarks. We also show that the learned representations could correspond to chemical features in the molecule data.

Keyword: Graph Neural Networks, Memory Networks, Hierarchial Graph Representation Learning

AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty
Author: Dan Hendrycks*, Norman Mu*, Ekin Dogus Cubuk, Barret Zoph, Justin Gilmer, Balaji Lakshminarayanan
link: https://openreview.net/pdf?id=S1gmrxHFvB
Code: https://github.com/google-research/augmix
Abstract: Modern deep neural networks can achieve high accuracy when the training distribution and test distribution are identically distributed, but this assumption is frequently violated in practice. When the train and test distributions are mismatched, accuracy can plummet. Currently there are few techniques that improve robustness to unforeseen data shifts encountered during deployment. In this work, we propose a technique to improve the robustness and uncertainty estimates of image classifiers. We propose AugMix, a data processing technique that is simple to implement, adds limited computational overhead, and helps models withstand unforeseen corruptions. AugMix significantly improves robustness and uncertainty measures on challenging image classification benchmarks, closing the gap between previous methods and the best possible performance in some cases by more than half.
Keyword: robustness, uncertainty

AtomNAS: Fine-Grained End-to-End Neural Architecture Search
Author: Jieru Mei, Yingwei Li, Xiaochen Lian, Xiaojie Jin, Linjie Yang, Alan Yuille, Jianchao Yang
link: https://openreview.net/pdf?id=BylQSxHFwr
Code: https://github.com/meijieru/AtomNAS
Abstract: Search space design is very critical to neural architecture search (NAS) algorithms. We propose a fine-grained search space comprised of atomic blocks, a minimal search unit that is much smaller than the ones used in recent NAS algorithms. This search space allows a mix of operations by composing different types of atomic blocks, while the search space in previous methods only allows homogeneous operations. Based on this search space, we propose a resource-aware architecture search framework which automatically assigns the computational resources (e.g., output channel numbers) for each operation by jointly considering the performance and the computational cost. In addition, to accelerate the search process, we propose a dynamic network shrinkage technique which prunes the atomic blocks with negligible influence on outputs on the fly. Instead of a search-and-retrain two-stage paradigm, our method simultaneously searches and trains the target architecture.
Our method achieves state-of-the-art performance under several FLOPs configurations on ImageNet with a small searching cost.
We open our entire codebase at:
Keyword: Neural Architecture Search, Image Classification

Residual Energy-Based Models for Text Generation
Author: Yuntian Deng, Anton Bakhtin, Myle Ott, Arthur Szlam
link: https://openreview.net/pdf?id=B1l4SgHKDH
Code: None
Abstract: Text generation is ubiquitous in many NLP tasks, from summarization, to dialogue and machine translation. The dominant parametric approach is based on locally normalized models which predict one word at a time. While these work remarkably well, they are plagued by exposure bias due to the greedy nature of the generation process. In this work, we investigate un-normalized energy-based models (EBMs) which operate not at the token but at the sequence level. In order to make training tractable, we first work in the residual of a pretrained locally normalized language model and second we train using noise contrastive estimation. Furthermore, since the EBM works at the sequence level, we can leverage pretrained bi-directional contextual representations, such as BERT and RoBERTa. Our experiments on two large language modeling datasets show that residual EBMs yield lower perplexity compared to locally normalized baselines. Moreover, generation via importance sampling is very efficient and of higher quality than the baseline models according to human evaluation.
Keyword: energy-based models, text generation

A closer look at the approximation capabilities of neural networks
Author: Kai Fong Ernest Chong
link: https://openreview.net/pdf?id=rkevSgrtPr
Code: None
Abstract: The universal approximation theorem, in one of its most general versions, says that if we consider only continuous activation functions σ, then a standard feedforward neural network with one hidden layer is able to approximate any continuous multivariate function f to any given approximation threshold ε, if and only if σ is non-polynomial. In this paper, we give a direct algebraic proof of the theorem. Furthermore we shall explicitly quantify the number of hidden units required for approximation. Specifically, if X in R^n is compact, then a neural network with n input units, m output units, and a single hidden layer with {n+d choose d} hidden units (independent of m and ε), can uniformly approximate any polynomial function f:X -> R^m whose total degree is at most d for each of its m coordinate functions. In the general case that f is any continuous function, we show there exists some N in O(ε^{-n}) (independent of m), such that N hidden units would suffice to approximate f. We also show that this uniform approximation property (UAP) still holds even under seemingly strong conditions imposed on the weights. We highlight several consequences: (i) For any δ > 0, the UAP still holds if we restrict all non-bias weights w in the last layer to satisfy |w| < δ. (ii) There exists some λ>0 (depending only on f and σ), such that the UAP still holds if we restrict all non-bias weights w in the first layer to satisfy |w|>λ. (iii) If the non-bias weights in the first layer are fixed and randomly chosen from a suitable range, then the UAP holds with probability 1.
Keyword: deep learning, approximation, universal approximation theorem

Deep Audio Priors Emerge From Harmonic Convolutional Networks
Author: Zhoutong Zhang, Yunyun Wang, Chuang Gan, Jiajun Wu, Joshua B. Tenenbaum, Antonio Torralba, William T. Freeman
link: https://openreview.net/pdf?id=rygjHxrYDB
Code: http://dap.csail.mit.edu/
Abstract: Convolutional neural networks (CNNs) excel in image recognition and generation. Among many efforts to explain their effectiveness, experiments show that CNNs carry strong inductive biases that capture natural image priors. Do deep networks also have inductive biases for audio signals? In this paper, we empirically show that current network architectures for audio processing do not show strong evidence in capturing such priors. We propose Harmonic Convolution, an operation that helps deep networks distill priors in audio signals by explicitly utilizing the harmonic structure within. This is done by engineering the kernel to be supported by sets of harmonic series, instead of local neighborhoods for convolutional kernels. We show that networks using Harmonic Convolution can reliably model audio priors and achieve high performance in unsupervised audio restoration tasks. With Harmonic Convolution, they also achieve better generalization performance for sound source separation.
Keyword: Audio, Deep Prior

Expected Information Maximization: Using the I-Projection for Mixture Density Estimation
Author: Philipp Becker, Oleg Arenz, Gerhard Neumann
link: https://openreview.net/pdf?id=ByglLlHFDS
Code: https://github.com/pbecker93/ExpectedInformationMaximization
Abstract: Modelling highly multi-modal data is a challenging problem in machine learning. Most algorithms are based on maximizing the likelihood, which corresponds to the M(oment)-projection of the data distribution to the model distribution.
The M-projection forces the model to average over modes it cannot represent. In contrast, the I(nformation)-projection ignores such modes in the data and concentrates on the modes the model can represent. Such behavior is appealing whenever we deal with highly multi-modal data where modelling single modes correctly is more important than covering all the modes. Despite this advantage, the I-projection is rarely used in practice due to the lack of algorithms that can efficiently optimize it based on data. In this work, we present a new algorithm called Expected Information Maximization (EIM) for computing the I-projection solely based on samples for general latent variable models, where we focus on Gaussian mixtures models and Gaussian mixtures of experts. Our approach applies a variational upper bound to the I-projection objective which decomposes the original objective into single objectives for each mixture component as well as for the coefficients, allowing an efficient optimization. Similar to GANs, our approach employs discriminators but uses a more stable optimization procedure, using a tight upper bound. We show that our algorithm is much more effective in computing the I-projection than recent GAN approaches and we illustrate the effectiveness of our approach for modelling multi-modal behavior on two pedestrian and traffic prediction datasets.
Keyword: density estimation, information projection, mixture models, generative learning, multimodal modeling

A Meta-Transfer Objective for Learning to Disentangle Causal Mechanisms
Author: Yoshua Bengio, Tristan Deleu, Nasim Rahaman, Nan Rosemary Ke, Sebastien Lachapelle, Olexa Bilaniuk, Anirudh Goyal, Christopher Pal
link: https://openreview.net/pdf?id=ryxWIgBFPS
Code: https://github.com/ec6dde01667145e58de60f864e05a4/CausalOptimizationAnon
Abstract: We propose to use a meta-learning objective that maximizes the speed of transfer on a modified distribution to learn how to modularize acquired knowledge. In particular, we focus on how to factor a joint distribution into appropriate conditionals, consistent with the causal directions. We explain when this can work, using the assumption that the changes in distributions are localized (e.g. to one of the marginals, for example due to an intervention on one of the variables). We prove that under this assumption of localized changes in causal mechanisms, the correct causal graph will tend to have only a few of its parameters with non-zero gradient, i.e. that need to be adapted (those of the modified variables). We argue and observe experimentally that this leads to faster adaptation, and use this property to define a meta-learning surrogate score which, in addition to a continuous parametrization of graphs, would favour correct causal graphs. Finally, motivated by the AI agent point of view (e.g. of a robot discovering its environment autonomously), we consider how the same objective can discover the causal variables themselves, as a transformation of observed low-level variables with no causal meaning. Experiments in the two-variable case validate the proposed ideas and theoretical results.
Keyword: meta-learning, transfer learning, structure learning, modularity, causality

On the interaction between supervision and self-play in emergent communication
Author: Ryan Lowe*, Abhinav Gupta*, Jakob Foerster, Douwe Kiela, Joelle Pineau
link: https://openreview.net/pdf?id=rJxGLlBtwH
Code: https://github.com/backpropper/s2p
Abstract: A promising approach for teaching artificial agents to use natural language involves using human-in-the-loop training. However, recent work suggests that current machine learning methods are too data inefficient to be trained in this way from scratch. In this paper, we investigate the relationship between two categories of learning signals with the ultimate goal of improving sample efficiency: imitating human language data via supervised learning, and maximizing reward in a simulated multi-agent environment via self-play (as done in emergent communication), and introduce the term supervised self-play (S2P) for algorithms using both of these signals. We find that first training agents via supervised learning on human data followed by self-play outperforms the converse, suggesting that it is not beneficial to emerge languages from scratch. We then empirically investigate various S2P schedules that begin with supervised learning in two environments: a Lewis signaling game with symbolic inputs, and an image-based referential game with natural language descriptions. Lastly, we introduce population based approaches to S2P, which further improves the performance over single-agent methods.
Keyword: multi-agent communication, self-play, emergent languages

Dynamic Model Pruning with Feedback
Author: Tao Lin, Sebastian U. Stich, Luis Barba, Daniil Dmitriev, Martin Jaggi
link: https://openreview.net/pdf?id=SJem8lSFwB
Code: None
Abstract: Deep neural networks often have millions of parameters. This can hinder their deployment to low-end devices, not only due to high memory requirements but also because of increased latency at inference. We propose a novel model compression method that generates a sparse trained model without additional overhead: by allowing (i) dynamic allocation of the sparsity pattern and (ii) incorporating feedback signal to reactivate prematurely pruned weights we obtain a performant sparse model in one single training pass (retraining is not needed, but can further improve the performance). We evaluate the method on CIFAR-10 and ImageNet, and show that the obtained sparse models can reach the state-of-the-art performance of dense models and further that their performance surpasses all previously proposed pruning schemes (that come without feedback mechanisms).
Keyword: network pruning, dynamic reparameterization, model compression

Latent Normalizing Flows for Many-to-Many Cross-Domain Mappings
Author: Shweta Mahajan, Iryna Gurevych, Stefan Roth
link: https://openreview.net/pdf?id=SJxE8erKDH
Code: https://drive.google.com/file/d/15orWOM0WowN4OevkHstx600TgNtAXGcp/view?usp=sharing
Abstract: Learned joint representations of images and text form the backbone of several important cross-domain tasks such as image captioning. Prior work mostly maps both domains into a common latent representation in a purely supervised fashion. This is rather restrictive, however, as the two domains follow distinct generative processes. Therefore, we propose a novel semi-supervised framework, which models shared information between domains and domain-specific information separately.
The information shared between the domains is aligned with an invertible neural network. Our model integrates normalizing flow-based priors for the domain-specific information, which allows us to learn diverse many-to-many mappings between the two domains. We demonstrate the effectiveness of our model on diverse tasks, including image captioning and text-to-image synthesis.
Keyword: None

Transferring Optimality Across Data Distributions via Homotopy Methods
Author: Matilde Gargiani, Andrea Zanelli, Quoc Tran Dinh, Moritz Diehl, Frank Hutter
link: https://openreview.net/pdf?id=S1gEIerYwH
Code: None
Abstract: Homotopy methods, also known as continuation methods, are a powerful mathematical tool to efficiently solve various problems in numerical analysis, including complex non-convex optimization problems where no or only little prior knowledge regarding the localization of the solutions is available.
In this work, we propose a novel homotopy-based numerical method that can be used to transfer knowledge regarding the localization of an optimum across different task distributions in deep learning applications. We validate the proposed methodology with some empirical evaluations in the regression and classification scenarios, where it shows that superior numerical performance can be achieved in popular deep learning benchmarks, i.e. FashionMNIST, CIFAR-10, and draw connections with the widely used fine-tuning heuristic. In addition, we give more insights on the properties of a general homotopy method when used in combination with Stochastic Gradient Descent by conducting a general local theoretical analysis in a simplified setting.
Keyword: deep learning, numerical optimization, transfer learning

Regularizing activations in neural networks via distribution matching with the Wasserstein metric
Author: Taejong Joo, Donggu Kang, Byunghoon Kim
link: https://openreview.net/pdf?id=rygwLgrYPB
Code: None
Abstract: Regularization and normalization have become indispensable components in training deep neural networks, resulting in faster training and improved generalization performance. We propose the projected error function regularization loss (PER) that encourages activations to follow the standard normal distribution. PER randomly projects activations onto one-dimensional space and computes the regularization loss in the projected space. PER is similar to the Pseudo-Huber loss in the projected space, thus taking advantage of both $L^1$ and $L^2$ regularization losses. Besides, PER can capture the interaction between hidden units by projection vector drawn from a unit sphere. By doing so, PER minimizes the upper bound of the Wasserstein distance of order one between an empirical distribution of activations and the standard normal distribution. To the best of the authors’ knowledge, this is the first work to regularize activations via distribution matching in the probability distribution space. We evaluate the proposed method on the image classification task and the word-level language modeling task.

Keyword: regularization, Wasserstein metric, deep learning

Mutual Information Gradient Estimation for Representation Learning
Author: Liangjian Wen, Yiji Zhou, Lirong He, Mingyuan Zhou, Zenglin Xu
link: https://openreview.net/pdf?id=ByxaUgrFvH
Code: None
Abstract: Mutual Information (MI) plays an important role in representation learning. However, MI is unfortunately intractable in continuous and high-dimensional settings. Recent advances establish tractable and scalable MI estimators to discover useful representation. However, most of the existing methods are not capable of providing an accurate estimation of MI with low-variance when the MI is large. We argue that directly estimating the gradients of MI is more appealing for representation learning than estimating MI in itself. To this end, we propose the Mutual Information Gradient Estimator (MIGE) for representation learning based on the score estimation of implicit distributions. MIGE exhibits a tight and smooth gradient estimation of MI in the high-dimensional and large-MI settings. We expand the applications of MIGE in both unsupervised learning of deep representations based on InfoMax and the Information Bottleneck method. Experimental results have indicated significant performance improvement in learning useful representation.
Keyword: Mutual Information, Score Estimation, Representation Learning, Information Bottleneck

Lite Transformer with Long-Short Range Attention
Author: Zhanghao Wu*, Zhijian Liu*, Ji Lin, Yujun Lin, Song Han
link: https://openreview.net/pdf?id=ByeMPlHKPH
Code: None
Abstract: Transformer has become ubiquitous in natural language processing (e.g., machine translation, question answering); however, it requires enormous amount of computations to achieve high performance, which makes it not suitable for mobile applications since mobile phones are tightly constrained by the hardware resources and battery. In this paper, we investigate the mobile setting (under 500M Mult-Adds) for NLP tasks to facilitate the deployment on the edge devices. We present Long-Short Range Attention (LSRA), where one group of heads specializes in the local context modeling (by convolution) while another group captures the long-distance relationship (by attention). Based on this primitive, we design Lite Transformer that is tailored for the mobile NLP application. Our Lite Transformer demonstrates consistent improvement over the transformer on three well-established language tasks: machine translation, abstractive summarization, and language modeling. It outperforms the transformer on WMT’14 English-French by 1.2 BLEU under 500M Mult-Adds and 1.7 BLEU under 100M Mult-Adds, and reduces the computation of transformer base model by 2.5x. Further, with general techniques, our Lite Transformer achieves 18.2x model size compression. For language modeling, our Lite Transformer also achieves 3.8 lower perplexity than the transformer around 500M Mult-Adds. Without the costly architecture search that requires more than 250 GPU years, our Lite Transformer outperforms the AutoML-based Evolved Transformer by 0.5 higher BLEU under the mobile setting.
Keyword: efficient model, transformer

A Function Space View of Bounded Norm Infinite Width ReLU Nets: The Multivariate Case
Author: Greg Ongie, Rebecca Willett, Daniel Soudry, Nathan Srebro
link: https://openreview.net/pdf?id=H1lNPxHKDH
Code: None
Abstract: We give a tight characterization of the (vectorized Euclidean) norm of weights required to realize a function $f:\mathbb{R}\rightarrow \mathbb{R}^d$ as a single hidden-layer ReLU network with an unbounded number of units (infinite width), extending the univariate characterization of Savarese et al. (2019) to the multivariate case.
Keyword: inductive bias, regularization, infinite-width networks, ReLU networks

Adversarial Lipschitz Regularization
Author: Dávid Terjék
link: https://openreview.net/pdf?id=Bke_DertPB
Code: https://drive.google.com/open?id=11CVllq2OmppENKBQdqGIBz_BYTiiZVl_
Abstract: Generative adversarial networks (GANs) are one of the most popular approaches when it comes to training generative models, among which variants of Wasserstein GANs are considered superior to the standard GAN formulation in terms of learning stability and sample quality. However, Wasserstein GANs require the critic to be 1-Lipschitz, which is often enforced implicitly by penalizing the norm of its gradient, or by globally restricting its Lipschitz constant via weight normalization techniques. Training with a regularization term penalizing the violation of the Lipschitz constraint explicitly, instead of through the norm of the gradient, was found to be practically infeasible in most situations. Inspired by Virtual Adversarial Training, we propose a method called Adversarial Lipschitz Regularization, and show that using an explicit Lipschitz penalty is indeed viable and leads to competitive performance when applied to Wasserstein GANs, highlighting an important connection between Lipschitz regularization and adversarial training.
Keyword: generative adversarial networks, wasserstein generative adversarial networks, lipschitz regularization, adversarial training

Compositional Language Continual Learning
Author: Yuanpeng Li, Liang Zhao, Kenneth Church, Mohamed Elhoseiny
link: https://openreview.net/pdf?id=rklnDgHtDS
Code: https://github.com/yli1/CLCL
Abstract: Motivated by the human’s ability to continually learn and gain knowledge over time, several research efforts have been pushing the limits of machines to constantly learn while alleviating catastrophic forgetting. Most of the existing methods have been focusing on continual learning of label prediction tasks, which have fixed input and output sizes. In this paper, we propose a new scenario of continual learning which handles sequence-to-sequence tasks common in language learning. We further propose an approach to use label prediction continual learning algorithm for sequence-to-sequence continual learning by leveraging compositionality. Experimental results show that the proposed method has significant improvement over state-of-the-art methods. It enables knowledge transfer and prevents catastrophic forgetting, resulting in more than 85% accuracy up to 100 stages, compared with less than 50% accuracy for baselines in instruction learning task. It also shows significant improvement in machine translation task. This is the first work to combine continual learning and compositionality for language learning, and we hope this work will make machines more helpful in various tasks.
Keyword: Compositionality, Continual Learning, Lifelong Learning, Sequence to Sequence Modeling

End to End Trainable Active Contours via Differentiable Rendering
Author: Shir Gur, Tal Shaharabany, Lior Wolf
link: https://openreview.net/pdf?id=rkxawlHKDr
Code: None
Abstract: We present an image segmentation method that iteratively evolves a polygon. At each iteration, the vertices of the polygon are displaced based on the local value of a 2D shift map that is inferred from the input image via an encoder-decoder architecture. The main training loss that is used is the difference between the polygon shape and the ground truth segmentation mask. The network employs a neural renderer to create the polygon from its vertices, making the process fully differentiable. We demonstrate that our method outperforms the state of the art segmentation networks and deep active contour solutions in a variety of benchmarks, including medical imaging and aerial images.
Keyword: None

Provable Filter Pruning for Efficient Neural Networks
Author: Lucas Liebenwein, Cenk Baykal, Harry Lang, Dan Feldman, Daniela Rus
link: https://openreview.net/pdf?id=BJxkOlSYDH
Code: https://github.com/lucaslie/provable_pruning
Abstract: We present a provable, sampling-based approach for generating compact Convolutional Neural Networks (CNNs) by identifying and removing redundant filters from an over-parameterized network. Our algorithm uses a small batch of input data points to assign a saliency score to each filter and constructs an importance sampling distribution where filters that highly affect the output are sampled with correspondingly high probability.
In contrast to existing filter pruning approaches, our method is simultaneously data-informed, exhibits provable guarantees on the size and performance of the pruned network, and is widely applicable to varying network architectures and data sets. Our analytical bounds bridge the notions of compressibility and importance of network structures, which gives rise to a fully-automated procedure for identifying and preserving filters in layers that are essential to the network’s performance. Our experimental evaluations on popular architectures and data sets show that our algorithm consistently generates sparser and more efficient models than those constructed by existing filter pruning approaches.
Keyword: theory, compression, filter pruning, neural networks

Effect of Activation Functions on the Training of Overparametrized Neural Nets
Author: Abhishek Panigrahi, Abhishek Shetty, Navin Goyal
link: https://openreview.net/pdf?id=rkgfdeBYvH
Code: https://drive.google.com/file/d/1Erj761XggITFSlcdiJJ8fKAkoEALU4L8/view?usp=sharing
Abstract: It is well-known that overparametrized neural networks trained using gradient based methods quickly achieve small training error with appropriate hyperparameter settings. Recent papers have proved this statement theoretically for highly overparametrized networks under reasonable assumptions. These results either assume that the activation function is ReLU or they depend on the minimum eigenvalue of a certain Gram matrix. In the latter case, existing works only prove that this minimum eigenvalue is non-zero and do not provide quantitative bounds which require that this eigenvalue be large. Empirically, a number of alternative activation functions have been proposed which tend to perform better than ReLU at least in some settings but no clear understanding has emerged. This state of affairs underscores the importance of theoretically understanding the impact of activation functions on training. In the present paper, we provide theoretical results about the effect of activation function on the training of highly overparametrized 2-layer neural networks. A crucial property that governs the performance of an activation is whether or not it is smooth:
• For non-smooth activations such as ReLU, SELU, ELU, which are not smooth because there is a point where either the ﬁrst order or second order derivative is discontinuous, all eigenvalues of the associated Gram matrix are large under minimal assumptions on the data.
• For smooth activations such as tanh, swish, polynomial, which have derivatives of all orders at all points, the situation is more complex: if the subspace spanned by the data has small dimension then the minimum eigenvalue of the Gram matrix can be small leading to slow training. But if the dimension is large and the data satisﬁes another mild condition, then the eigenvalues are large. If we allow deep networks, then the small data dimension is not a limitation provided that the depth is sufﬁcient.
We discuss a number of extensions and applications of these results.
Keyword: activation functions, deep learning theory, neural networks

Lipschitz constant estimation of Neural Networks via sparse polynomial optimization
Author: Fabian Latorre, Paul Rolland, Volkan Cevher
link: https://openreview.net/pdf?id=rJe4_xSFDB
Code: https://drive.google.com/drive/folders/1bkj0H6Thgd9sjRloyq9NBP0uO0v704E9?usp=sharing
Abstract: We introduce LiPopt, a polynomial optimization framework for computing increasingly tighter upper bound on the Lipschitz constant of neural networks. The underlying optimization problems boil down to either linear (LP) or semidefinite (SDP) programming. We show how to use the sparse connectivity of a network, to significantly reduce the complexity of computation. This is specially useful for convolutional as well as pruned neural networks. We conduct experiments on networks with random weights as well as networks trained on MNIST, showing that in the particular case of the $\ell_\infty$ -Lipschitz constant, our approach yields superior estimates as compared to other baselines available in the literature.

Keyword: robust networks, Lipschitz constant, polynomial optimization

State Alignment-based Imitation Learning
Author: Fangchen Liu, Zhan Ling, Tongzhou Mu, Hao Su
link: https://openreview.net/pdf?id=rylrdxHFDr
Code: None
Abstract: Consider an imitation learning problem that the imitator and the expert have different dynamics models. Most of existing imitation learning methods fail because they focus on the imitation of actions. We propose a novel state alignment-based imitation learning method to train the imitator by following the state sequences in the expert demonstrations as much as possible. The alignment of states comes from both local and global perspectives. We combine them into a reinforcement learning framework by a regularized policy update objective. We show the superiority of our method on standard imitation learning settings as well as the challenging settings in which the expert and the imitator have different dynamics models.
Keyword: Imitation learning, Reinforcement Learning

Learning to Group: A Bottom-Up Framework for 3D Part Discovery in Unseen Categories
Author: Tiange Luo, Kaichun Mo, Zhiao Huang, Jiarui Xu, Siyu Hu, Liwei Wang, Hao Su
link: https://openreview.net/pdf?id=rkl8dlHYvB
Code: None
Abstract: We address the problem of learning to discover 3D parts for objects in unseen categories. Being able to learn the geometry prior of parts and transfer this prior to unseen categories pose fundamental challenges on data-driven shape segmentation approaches. Formulated as a contextual bandit problem, we propose a learning-based iterative grouping framework which learns a grouping policy to progressively merge small part proposals into bigger ones in a bottom-up fashion. At the core of our approach is to restrict the local context for extracting part-level features, which encourages the generalizability to novel categories. On a recently proposed large-scale fine-grained 3D part dataset, PartNet, we demonstrate that our method can transfer knowledge of parts learned from 3 training categories to 21 unseen testing categories without seeing any annotated samples. Quantitative comparisons against four strong shape segmentation baselines show that we achieve the state-of-the-art performance.
Keyword: Shape Segmentation, Zero-Shot Learning, Learning Representations

Discriminative Particle Filter Reinforcement Learning for Complex Partial observations
Author: Xiao Ma, Peter Karkus, David Hsu, Wee Sun Lee, Nan Ye
link: https://openreview.net/pdf?id=HJl8_eHYvS
Code: None
Abstract: Deep reinforcement learning is successful in decision making for sophisticated games, such as Atari, Go, etc.
However, real-world decision making often requires reasoning with partial information extracted from complex visual observations. This paper presents Discriminative Particle Filter Reinforcement Learning (DPFRL), a new reinforcement learning framework for complex partial observations. DPFRL encodes a differentiable particle filter in the neural network policy for explicit reasoning with partial observations over time. The particle filter maintains a belief using learned discriminative update, which is trained end-to-end for decision making. We show that using the discriminative update instead of standard generative models results in significantly improved performance, especially for tasks with complex visual observations, because they circumvent the difficulty of modeling complex observations that are irrelevant to decision making.
In addition, to extract features from the particle belief, we propose a new type of belief feature based on the moment generating function. DPFRL outperforms state-of-the-art POMDP RL models in Flickering Atari Games, an existing POMDP RL benchmark, and in Natural Flickering Atari Games, a new, more challenging POMDP RL benchmark introduced in this paper. Further, DPFRL performs well for visual navigation with real-world data in the Habitat environment.
Keyword: Reinforcement Learning, Partial Observability, Differentiable Particle Filtering

Unrestricted Adversarial Examples via Semantic Manipulation
Author: Anand Bhattad, Min Jin Chong, Kaizhao Liang, Bo Li, D. A. Forsyth
link: https://openreview.net/pdf?id=Sye_OgHFwH
Code: https://www.dropbox.com/s/69zx437t1dgo41b/semantic_attack_code.zip?dl=0
Abstract: Machine learning models, especially deep neural networks (DNNs), have been shown to be vulnerable against adversarial examples which are carefully crafted samples with a small magnitude of the perturbation. Such adversarial perturbations are usually restricted by bounding their $\mathcal{L}_p$ norm such that they are imperceptible, and thus many current defenses can exploit this property to reduce their adversarial impact. In this paper, we instead introduce “unrestricted” perturbations that manipulate semantically meaningful image-based visual descriptors - color and texture - in order to generate effective and photorealistic adversarial examples. We show that these semantically aware perturbations are effective against JPEG compression, feature squeezing and adversarially trained model. We also show that the proposed methods can effectively be applied to both image classification and image captioning tasks on complex datasets such as ImageNet and MSCOCO. In addition, we conduct comprehensive user studies to show that our generated semantic adversarial examples are photorealistic to humans despite large magnitude perturbations when compared to other attacks.
Keyword: Adversarial Examples, Semantic Manipulation, Image Colorization, Texture Transfer

Classification-Based Anomaly Detection for General Data
Author: Liron Bergman, Yedid Hoshen
link: https://openreview.net/pdf?id=H1lK_lBtvS
Code: None
Abstract: Anomaly detection, finding patterns that substantially deviate from those seen previously, is one of the fundamental problems of artificial intelligence. Recently, classification-based methods were shown to achieve superior results on this task. In this work, we present a unifying view and propose an open-set method, GOAD, to relax current generalization assumptions. Furthermore, we extend the applicability of transformation-based methods to non-image data using random affine transformations. Our method is shown to obtain state-of-the-art accuracy and is applicable to broad data types. The strong performance of our method is extensively validated on multiple datasets from different domains.
Keyword: anomaly detection

Scale-Equivariant Steerable Networks
Author: Ivan Sosnovik, Michał Szmaja, Arnold Smeulders
link: https://openreview.net/pdf?id=HJgpugrKPS
Code: None
Abstract: The effectiveness of Convolutional Neural Networks (CNNs) has been substantially attributed to their built-in property of translation equivariance. However, CNNs do not have embedded mechanisms to handle other types of transformations. In this work, we pay attention to scale changes, which regularly appear in various tasks due to the changing distances between the objects and the camera. First, we introduce the general theory for building scale-equivariant convolutional networks with steerable filters. We develop scale-convolution and generalize other common blocks to be scale-equivariant. We demonstrate the computational efficiency and numerical stability of the proposed method. We compare the proposed models to the previously developed methods for scale equivariance and local scale invariance. We demonstrate state-of-the-art results on the MNIST-scale dataset and on the STL-10 dataset in the supervised learning setting.
Keyword: Scale Equivariance, Steerable Filters

On Generalization Error Bounds of Noisy Gradient Methods for Non-Convex Learning
Author: Jian Li, Xuanyuan Luo, Mingda Qiao
link: https://openreview.net/pdf?id=SkxxtgHKPS
Code: None
Abstract: Generalization error (also known as the out-of-sample error) measures how well the hypothesis learned from training data generalizes to previously unseen data. Proving tight generalization error bounds is a central question in statistical learning theory. In this paper, we obtain generalization error bounds for learning general non-convex objectives, which has attracted significant attention in recent years. We develop a new framework, termed Bayes-Stability, for proving algorithm-dependent generalization error bounds. The new framework combines ideas from both the PAC-Bayesian theory and the notion of algorithmic stability. Applying the Bayes-Stability method, we obtain new data-dependent generalization bounds for stochastic gradient Langevin dynamics (SGLD) and several other noisy gradient methods (e.g., with momentum, mini-batch and acceleration, Entropy-SGD). Our result recovers (and is typically tighter than) a recent result in Mou et al. (2018) and improves upon the results in Pensia et al. (2018). Our experiments demonstrate that our data-dependent bounds can distinguish randomly labelled data from normal data, which provides an explanation to the intriguing phenomena observed in Zhang et al. (2017a). We also study the setting where the total loss is the sum of a bounded loss and an additiona l`2 regularization term. We obtain new generalization bounds for the continuous Langevin dynamic in this setting by developing a new Log-Sobolev inequality for the parameter distribution at any time. Our new bounds are more desirable when the noise level of the processis not very small, and do not become vacuous even when T tends to infinity.
Keyword: learning theory, generalization, nonconvex learning, stochastic gradient descent, Langevin dynamics

Consistency Regularization for Generative Adversarial Networks
Author: Han Zhang, Zizhao Zhang, Augustus Odena, Honglak Lee
link: https://openreview.net/pdf?id=S1lxKlSKPH
Code: None
Abstract: Generative Adversarial Networks (GANs) are known to be difficult to train, despite considerable research effort. Several regularization techniques for stabilizing training have been proposed, but they introduce non-trivial computational overheads and interact poorly with existing techniques like spectral normalization. In this work, we propose a simple, effective training stabilizer based on the notion of consistency regularization—a popular technique in the semi-supervised learning literature. In particular, we augment data passing into the GAN discriminator and penalize the sensitivity of the discriminator to these augmentations. We conduct a series of experiments to demonstrate that consistency regularization works effectively with spectral normalization and various GAN architectures, loss functions and optimizer settings. Our method achieves the best FID scores for unconditional image generation compared to other regularization methods on CIFAR-10 and CelebA. Moreover, Our consistency regularized GAN (CR-GAN) improves state of-the-art FID scores for conditional generation from 14.73 to 11.48 on CIFAR-10 and from 8.73 to 6.66 on ImageNet-2012.
Keyword: Generative Adversarial Networks, Consistency Regularization, GAN

Differentiable learning of numerical rules in knowledge graphs
Author: Po-Wei Wang, Daria Stepanova, Csaba Domokos, J. Zico Kolter
link: https://openreview.net/pdf?id=rJleKgrKwS
Code: None
Abstract: Rules over a knowledge graph (KG) capture interpretable patterns in data and can be used for KG cleaning and completion. Inspired by the TensorLog differentiable logic framework, which compiles rule inference into a sequence of differentiable operations, recently a method called Neural LP has been proposed for learning the parameters as well as the structure of rules. However, it is limited with respect to the treatment of numerical features like age, weight or scientific measurements. We address this limitation by extending Neural LP to learn rules with numerical values, e.g., ”People younger than 18 typically live with their parents“. We demonstrate how dynamic programming and cumulative sum operations can be exploited to ensure efficiency of such extension. Our novel approach allows us to extract more expressive rules with aggregates, which are of higher quality and yield more accurate predictions compared to rules learned by the state-of-the-art methods, as shown by our experiments on synthetic and real-world datasets.
Keyword: knowledge graphs, rule learning, differentiable neural logic

Learning to Move with Affordance Maps
Author: William Qi, Ravi Teja Mullapudi, Saurabh Gupta, Deva Ramanan
link: https://openreview.net/pdf?id=BJgMFxrYPB
Code: https://github.com/wqi/A2L
Abstract: The ability to autonomously explore and navigate a physical space is a fundamental requirement for virtually any mobile autonomous agent, from household robotic vacuums to autonomous vehicles. Traditional SLAM-based approaches for exploration and navigation largely focus on leveraging scene geometry, but fail to model dynamic objects (such as other agents) or semantic constraints (such as wet floors or doorways). Learning-based RL agents are an attractive alternative because they can incorporate both semantic and geometric information, but are notoriously sample inefficient, difficult to generalize to novel settings, and are difficult to interpret. In this paper, we combine the best of both worlds with a modular approach that {\em learns} a spatial representation of a scene that is trained to be effective when coupled with traditional geometric planners. Specifically, we design an agent that learns to predict a spatial affordance map that elucidates what parts of a scene are navigable through active self-supervised experience gathering. In contrast to most simulation environments that assume a static world, we evaluate our approach in the VizDoom simulator, using large-scale randomly-generated maps containing a variety of dynamic actors and hazards. We show that learned affordance maps can be used to augment traditional approaches for both exploration and navigation, providing significant improvements in performance.
Keyword: navigation, exploration

Neural tangent kernels, transportation mappings, and universal approximation
Author: Ziwei Ji, Matus Telgarsky, Ruicheng Xian
link: https://openreview.net/pdf?id=HklQYxBKwS
Code: None
Abstract: This paper establishes rates of universal approximation for the shallow neural tangent kernel (NTK): network weights are only allowed microscopic changes from random initialization, which entails that activations are mostly unchanged, and the network is nearly equivalent to its linearization. Concretely, the paper has two main contributions: a generic scheme to approximate functions with the NTK by sampling from transport mappings between the initial weights and their desired values, and the construction of transport mappings via Fourier transforms. Regarding the first contribution, the proof scheme provides another perspective on how the NTK regime arises from rescaling: redundancy in the weights due to resampling allows individual weights to be scaled down. Regarding the second contribution, the most notable transport mapping asserts that roughly $\delta^{10d}$ nodes are sufficient to approximate continuous functions, where $\delta$ depends on the continuity properties of the target function. By contrast, nearly the same proof yields a bound of $\delta^{2d}$ for shallow ReLU networks; this gap suggests a tantalizing direction for future work, separating shallow ReLU networks and their linearization.

Keyword: Neural Tangent Kernel, universal approximation, Barron, transport mapping

SCALOR: Scalable Object-Oriented Sequential Generative Models
Author: Jindong Jiang*, Sepehr Janghorbani*, Gerard De Melo, Sungjin Ahn
link: https://openreview.net/pdf?id=SJxrKgStDH
Code: None
Abstract: Scalability in terms of object density in a scene is a primary challenge in unsupervised sequential object-oriented representation learning. Most of the previous models have been shown to work only on scenes with a few objects. In this paper, we propose SCALOR, a probabilistic generative model for learning SCALable Object-oriented Representation of a video. With the proposed spatially-parallel attention and proposal-rejection mechanisms, SCALOR can deal with orders of magnitude larger numbers of objects compared to the previous state-of-the-art models. Additionally, we introduce a background module that allows SCALOR to model complex dynamic backgrounds as well as many foreground objects in the scene. We demonstrate that SCALOR can deal with crowded scenes containing up to a hundred objects while jointly modeling complex dynamic backgrounds. Importantly, SCALOR is the ﬁrst unsupervised object representation model shown to work for natural scenes containing several tens of moving objects.
Keyword: None

Prediction Poisoning: Towards Defenses Against DNN Model Stealing Attacks
Author: Tribhuvanesh Orekondy, Bernt Schiele, Mario Fritz
link: https://openreview.net/pdf?id=SyevYxHtDB
Code: None
Abstract: High-performance Deep Neural Networks (DNNs) are increasingly deployed in many real-world applications e.g., cloud prediction APIs. Recent advances in model functionality stealing attacks via black-box access (i.e., inputs in, predictions out) threaten the business model of such applications, which require a lot of time, money, and effort to develop. Existing defenses take a passive role against stealing attacks, such as by truncating predicted information. We find such passive defenses ineffective against DNN stealing attacks. In this paper, we propose the first defense which actively perturbs predictions targeted at poisoning the training objective of the attacker. We find our defense effective across a wide range of challenging datasets and DNN model stealing attacks, and additionally outperforms existing defenses. Our defense is the first that can withstand highly accurate model stealing attacks for tens of thousands of queries, amplifying the attacker’s error rate up to a factor of 85 $\times$ with minimal impact on the utility for benign users.
Keyword: model functionality stealing, adversarial machine learning

Domain Adaptive Multibranch Networks
Author: Róger Bermúdez-Chacón, Mathieu Salzmann, Pascal Fua
link: https://openreview.net/pdf?id=rJxycxHKDS
Code: None
Abstract: We tackle unsupervised domain adaptation by accounting for the fact that different domains may need to be processed differently to arrive to a common feature representation effective for recognition. To this end, we introduce a deep learning framework where each domain undergoes a different sequence of operations, allowing some, possibly more complex, domains to go through more computations than others.
This contrasts with state-of-the-art domain adaptation techniques that force all domains to be processed with the same series of operations, even when using multi-stream architectures whose parameters are not shared.
As evidenced by our experiments, the greater flexibility of our method translates to higher accuracy. Furthermore, it allows us to handle any number of domains simultaneously.
Keyword: Domain Adaptation, Computer Vision

DiffTaichi: Differentiable Programming for Physical Simulation
Author: Yuanming Hu, Luke Anderson, Tzu-Mao Li, Qi Sun, Nathan Carr, Jonathan Ragan-Kelley, Fredo Durand
link: https://openreview.net/pdf?id=B1eB5xSFvr
Code: https://github.com/yuanming-hu/difftaichi
Abstract: We present DiffTaichi, a new differentiable programming language tailored for building high-performance differentiable physical simulators. Based on an imperative programming language, DiffTaichi generates gradients of simulation steps using source code transformations that preserve arithmetic intensity and parallelism. A light-weight tape is used to record the whole simulation program structure and replay the gradient kernels in a reversed order, for end-to-end backpropagation.
We demonstrate the performance and productivity of our language in gradient-based learning and optimization tasks on 10 different physical simulators. For example, a differentiable elastic object simulator written in our language is 4.2x shorter than the hand-engineered CUDA version yet runs as fast, and is 188x faster than the TensorFlow implementation.
Using our differentiable programs, neural network controllers are typically optimized within only tens of iterations.
Keyword: Differentiable programming, robotics, optimal control, physical simulation, machine learning system

Pitfalls of In-Domain Uncertainty Estimation and Ensembling in Deep Learning
Author: Arsenii Ashukha, Alexander Lyzhov, Dmitry Molchanov, Dmitry Vetrov
link: https://openreview.net/pdf?id=BJxI5gHKDr
Code: https://github.com/bayesgroup/pytorch-ensembles
Abstract: Uncertainty estimation and ensembling methods go hand-in-hand. Uncertainty estimation is one of the main benchmarks for assessment of ensembling performance. At the same time, deep learning ensembles have provided state-of-the-art results in uncertainty estimation. In this work, we focus on in-domain uncertainty for image classification. We explore the standards for its quantification and point out pitfalls of existing metrics. Avoiding these pitfalls, we perform a broad study of different ensembling techniques. To provide more insight in this study, we introduce the deep ensemble equivalent score (DEE) and show that many sophisticated ensembling techniques are equivalent to an ensemble of only few independently trained networks in terms of test performance.
Keyword: uncertainty, in-domain uncertainty, deep ensembles, ensemble learning, deep learning

Episodic Reinforcement Learning with Associative Memory
Author: Guangxiang Zhu*, Zichuan Lin*, Guangwen Yang, Chongjie Zhang
link: https://openreview.net/pdf?id=HkxjqxBYDB
Code: None
Abstract: Sample efficiency has been one of the major challenges for deep reinforcement learning. Non-parametric episodic control has been proposed to speed up parametric reinforcement learning by rapidly latching on previously successful policies. However, previous work on episodic reinforcement learning neglects the relationship between states and only stored the experiences as unrelated items. To improve sample efficiency of reinforcement learning, we propose a novel framework, called Episodic Reinforcement Learning with Associative Memory (ERLAM), which associates related experience trajectories to enable reasoning effective strategies. We build a graph on top of states in memory based on state transitions and develop a reverse-trajectory propagation strategy to allow rapid value propagation through the graph. We use the non-parametric associative memory as early guidance for a parametric reinforcement learning model. Results on navigation domain and Atari games show our framework achieves significantly higher sample efficiency than state-of-the-art episodic reinforcement learning models.
Keyword: Deep Reinforcement Learning, Episodic Control, Episodic Memory, Associative Memory, Non-Parametric Method, Sample Efficiency

Sub-policy Adaptation for Hierarchical Reinforcement Learning
Author: Alexander Li, Carlos Florensa, Ignasi Clavera, Pieter Abbeel
link: https://openreview.net/pdf?id=ByeWogStDS
Code: https://anonymous.4open.science/r/de105a6d-8f8b-405e-b90a-54ab74adcb17/
Abstract: Hierarchical reinforcement learning is a promising approach to tackle long-horizon decision-making problems with sparse rewards. Unfortunately, most methods still decouple the lower-level skill acquisition process and the training of a higher level that controls the skills in a new task. Leaving the skills fixed can lead to significant sub-optimality in the transfer setting. In this work, we propose a novel algorithm to discover a set of skills, and continuously adapt them along with the higher level even when training on a new task. Our main contributions are two-fold. First, we derive a new hierarchical policy gradient with an unbiased latent-dependent baseline, and we introduce Hierarchical Proximal Policy Optimization (HiPPO), an on-policy method to efficiently train all levels of the hierarchy jointly. Second, we propose a method of training time-abstractions that improves the robustness of the obtained skills to environment changes. Code and videos are available at sites.google.com/view/hippo-rl.
Keyword: Hierarchical Reinforcement Learning, Transfer, Skill Discovery

Critical initialisation in continuous approximations of binary neural networks
Author: George Stamatescu, Federica Gerace, Carlo Lucibello, Ian Fuss, Langford White
link: https://openreview.net/pdf?id=rylmoxrFDH
Code: None
Abstract: The training of stochastic neural network models with binary ( $\pm1$ ) weights and activations via continuous surrogate networks is investigated. We derive new surrogates using a novel derivation based on writing the stochastic neural network as a Markov chain. This derivation also encompasses existing variants of the surrogates presented in the literature. Following this, we theoretically study the surrogates at initialisation. We derive, using mean field theory, a set of scalar equations describing how input signals propagate through the randomly initialised networks. The equations reveal whether so-called critical initialisations exist for each surrogate network, where the network can be trained to arbitrary depth. Moreover, we predict theoretically and confirm numerically, that common weight initialisation schemes used in standard continuous networks, when applied to the mean values of the stochastic binary weights, yield poor training performance. This study shows that, contrary to common intuition, the means of the stochastic binary weights should be initialised close to $\pm 1$ , for deeper networks to be trainable.
Keyword: None

Deep Orientation Uncertainty Learning based on a Bingham Loss
Author: Igor Gilitschenski, Roshni Sahoo, Wilko Schwarting, Alexander Amini, Sertac Karaman, Daniela Rus
link: https://openreview.net/pdf?id=ryloogSKDS
Code: None
Abstract: Reasoning about uncertain orientations is one of the core problems in many perception tasks such as object pose estimation or motion estimation. In these scenarios, poor illumination conditions, sensor limitations, or appearance invariance may result in highly uncertain estimates. In this work, we propose a novel learning-based representation for orientation uncertainty. By characterizing uncertainty over unit quaternions with the Bingham distribution, we formulate a loss that naturally captures the antipodal symmetry of the representation. We discuss the interpretability of the learned distribution parameters and demonstrate the feasibility of our approach on several challenging real-world pose estimation tasks involving uncertain orientations.
Keyword: Orientation Estimation, Directional Statistics, Bingham Distribution

Co-Attentive Equivariant Neural Networks: Focusing Equivariance On Transformations Co-Occurring in Data
Author: David W. Romero, Mark Hoogendoorn
link: https://openreview.net/pdf?id=r1g6ogrtDr
Code: https://www.dropbox.com/sh/2gghao89strdotw/AAAYJ6XclnfeoS3AfN9Z-n5Wa?dl=0
Abstract: Equivariance is a nice property to have as it produces much more parameter efficient neural architectures and preserves the structure of the input through the feature mapping. Even though some combinations of transformations might never appear (e.g. an upright face with a horizontal nose), current equivariant architectures consider the set of all possible transformations in a transformation group when learning feature representations. Contrarily, the human visual system is able to attend to the set of relevant transformations occurring in the environment and utilizes this information to assist and improve object recognition. Based on this observation, we modify conventional equivariant feature mappings such that they are able to attend to the set of co-occurring transformations in data and generalize this notion to act on groups consisting of multiple symmetries. We show that our proposed co-attentive equivariant neural networks consistently outperform conventional rotation equivariant and rotation & reflection equivariant neural networks on rotated MNIST and CIFAR-10.
Keyword: Equivariant Neural Networks, Attention Mechanisms, Deep Learning

Mixed Precision DNNs: All you need is a good parametrization
Author: Stefan Uhlich, Lukas Mauch, Fabien Cardinaux, Kazuki Yoshiyama, Javier Alonso Garcia, Stephen Tiedemann, Thomas Kemp, Akira Nakamura
link: https://openreview.net/pdf?id=Hyx0slrFvH
Code: None
Abstract: Efficient deep neural network (DNN) inference on mobile or embedded devices typically involves quantization of the network parameters and activations. In particular, mixed precision networks achieve better performance than networks with homogeneous bitwidth for the same size constraint. Since choosing the optimal bitwidths is not straight forward, training methods, which can learn them, are desirable. Differentiable quantization with straight-through gradients allows to learn the quantizer’s parameters using gradient methods. We show that a suited parametrization of the quantizer is the key to achieve a stable training and a good final performance. Specifically, we propose to parametrize the quantizer with the step size and dynamic range. The bitwidth can then be inferred from them. Other parametrizations, which explicitly use the bitwidth, consistently perform worse. We confirm our findings with experiments on CIFAR-10 and ImageNet and we obtain mixed precision DNNs with learned quantization parameters, achieving state-of-the-art performance.
Keyword: Deep Neural Network Compression, Quantization, Straight through gradients

Information Geometry of Orthogonal Initializations and Training
Author: Piotr Aleksander Sokół, Il Memming Park
link: https://openreview.net/pdf?id=rkg1ngrFPr
Code: https://github.com/PiotrSokol/info-geom
Abstract: Recently mean field theory has been successfully used to analyze properties
of wide, random neural networks. It gave rise to a prescriptive theory for
initializing feed-forward neural networks with orthogonal weights, which
ensures that both the forward propagated activations and the backpropagated
gradients are near (\ell_2) isometries and as a consequence training is
orders of magnitude faster. Despite strong empirical performance, the
mechanisms by which critical initializations confer an advantage in the
optimization of deep neural networks are poorly understood. Here we show a
novel connection between the maximum curvature of the optimization landscape
(gradient smoothness) as measured by the Fisher information matrix (FIM) and
the spectral radius of the input-output Jacobian, which partially explains
why more isometric networks can train much faster. Furthermore, given that
orthogonal weights are necessary to ensure that gradient norms are
approximately preserved at initialization, we experimentally investigate the
benefits of maintaining orthogonality throughout training, and we conclude
that manifold optimization of weights performs well regardless of the
smoothness of the gradients. Moreover, we observe a surprising yet robust
behavior of highly isometric initializations — even though such networks
have a lower FIM condition number \emph{at initialization}, and therefore by
analogy to convex functions should be easier to optimize, experimentally
they prove to be much harder to train with stochastic gradient descent. We
conjecture the FIM condition number plays a non-trivial role in the optimization.
Keyword: Fisher, mean-field, deep learning

Extreme Classification via Adversarial Softmax Approximation
Author: Robert Bamler, Stephan Mandt
link: https://openreview.net/pdf?id=rJxe3xSYDS
Code: https://github.com/mandt-lab/adversarial-negative-sampling
Abstract: Training a classifier over a large number of classes, known as ‘extreme classification’, has become a topic of major interest with applications in technology, science, and e-commerce. Traditional softmax regression induces a gradient cost proportional to the number of classes C, which often is prohibitively expensive. A popular scalable softmax approximation relies on uniform negative sampling, which suffers from slow convergence due a poor signal-to-noise ratio. In this paper, we propose a simple training method for drastically enhancing the gradient signal by drawing negative samples from an adversarial model that mimics the data distribution. Our contributions are three-fold: (i) an adversarial sampling mechanism that produces negative samples at a cost only logarithmic in C, thus still resulting in cheap gradient updates; (ii) a mathematical proof that this adversarial sampling minimizes the gradient variance while any bias due to non-uniform sampling can be removed; (iii) experimental results on large scale data sets that show a reduction of the training time by an order of magnitude relative to several competitive baselines.

Keyword: Extreme classification, negative sampling

Learning Nearly Decomposable Value Functions Via Communication Minimization
Author: Tonghan Wang*, Jianhao Wang*, Chongyi Zheng, Chongjie Zhang
link: https://openreview.net/pdf?id=HJx-3grYDB
Code: https://gofile.io/?c=b9ipVV
Abstract: Reinforcement learning encounters major challenges in multi-agent settings, such as scalability and non-stationarity. Recently, value function factorization learning emerges as a promising way to address these challenges in collaborative multi-agent systems. However, existing methods have been focusing on learning fully decentralized value functions, which are not efficient for tasks requiring communication. To address this limitation, this paper presents a novel framework for learning nearly decomposable Q-functions (NDQ) via communication minimization, with which agents act on their own most of the time but occasionally send messages to other agents in order for effective coordination. This framework hybridizes value function factorization learning and communication learning by introducing two information-theoretic regularizers. These regularizers are maximizing mutual information between agents’ action selection and communication messages while minimizing the entropy of messages between agents. We show how to optimize these regularizers in a way that is easily integrated with existing value function factorization methods such as QMIX. Finally, we demonstrate that, on the StarCraft unit micromanagement benchmark, our framework significantly outperforms baseline methods and allows us to cut off more than $80\%$ of communication without sacrificing the performance. The videos of our experiments are available at
Keyword: Multi-agent reinforcement learning, Nearly decomposable value function, Minimized communication

Robust Subspace Recovery Layer for Unsupervised Anomaly Detection
Author: Chieh-Hsin Lai, Dongmian Zou, Gilad Lerman
link: https://openreview.net/pdf?id=rylb3eBtwr
Code: None
Abstract: We propose a neural network for unsupervised anomaly detection with a novel robust subspace recovery layer (RSR layer). This layer seeks to extract the underlying subspace from a latent representation of the given data and removes outliers that lie away from this subspace. It is used within an autoencoder. The encoder maps the data into a latent space, from which the RSR layer extracts the subspace. The decoder then smoothly maps back the underlying subspace to a ``manifold" close to the original inliers. Inliers and outliers are distinguished according to the distances between the original and mapped positions (small for inliers and large for outliers). Extensive numerical experiments with both image and document datasets demonstrate state-of-the-art precision and recall.
Keyword: robust subspace recovery, unsupervised anomaly detection, outliers, latent space, autoencoder

Learning to Coordinate Manipulation Skills via Skill Behavior Diversification
Author: Youngwoon Lee, Jingyun Yang, Joseph J. Lim
link: https://openreview.net/pdf?id=ryxB2lBtvH
Code: https://github.com/clvrai/coordination
Abstract: When mastering a complex manipulation task, humans often decompose the task into sub-skills of their body parts, practice the sub-skills independently, and then execute the sub-skills together. Similarly, a robot with multiple end-effectors can perform complex tasks by coordinating sub-skills of each end-effector. To realize temporal and behavioral coordination of skills, we propose a modular framework that first individually trains sub-skills of each end-effector with skill behavior diversification, and then learns to coordinate end-effectors using diverse behaviors of the skills. We demonstrate that our proposed framework is able to efficiently coordinate skills to solve challenging collaborative control tasks such as picking up a long bar, placing a block inside a container while pushing the container with two robot arms, and pushing a box with two ant agents. Videos and code are available at
Keyword: reinforcement learning, hierarchical reinforcement learning, modular framework, skill coordination, bimanual manipulation

NAS-Bench-1Shot1: Benchmarking and Dissecting One-shot Neural Architecture Search
Author: Arber Zela, Julien Siems, Frank Hutter
link: https://openreview.net/pdf?id=SJx9ngStPH
Code: https://github.com/automl/nasbench-1shot1
Abstract: One-shot neural architecture search (NAS) has played a crucial role in making
NAS methods computationally feasible in practice. Nevertheless, there is still a
lack of understanding on how these weight-sharing algorithms exactly work due
to the many factors controlling the dynamics of the process. In order to allow
a scientific study of these components, we introduce a general framework for
one-shot NAS that can be instantiated to many recently-introduced variants and
introduce a general benchmarking framework that draws on the recent large-scale
tabular benchmark NAS-Bench-101 for cheap anytime evaluations of one-shot
NAS methods. To showcase the framework, we compare several state-of-the-art
one-shot NAS methods, examine how sensitive they are to their hyperparameters
and how they can be improved by tuning their hyperparameters, and compare their
performance to that of blackbox optimizers for NAS-Bench-101.
Keyword: Neural Architecture Search, Deep Learning, Computer Vision

Conservative Uncertainty Estimation By Fitting Prior Networks
Author: Kamil Ciosek, Vincent Fortuin, Ryota Tomioka, Katja Hofmann, Richard Turner
link: https://openreview.net/pdf?id=BJlahxHYDS
Code: None
Abstract: Obtaining high-quality uncertainty estimates is essential for many applications of deep neural networks. In this paper, we theoretically justify a scheme for estimating uncertainties, based on sampling from a prior distribution. Crucially, the uncertainty estimates are shown to be conservative in the sense that they never underestimate a posterior uncertainty obtained by a hypothetical Bayesian algorithm. We also show concentration, implying that the uncertainty estimates converge to zero as we get more data. Uncertainty estimates obtained from random priors can be adapted to any deep network architecture and trained using standard supervised learning pipelines. We provide experimental evaluation of random priors on calibration and out-of-distribution detection on typical computer vision tasks, demonstrating that they outperform deep ensembles in practice.
Keyword: uncertainty quantification, deep learning, Gaussian process, epistemic uncertainty, random network, prior, Bayesian inference

Understanding Generalization in Recurrent Neural Networks
Author: Zhuozhuo Tu, Fengxiang He, Dacheng Tao
link: https://openreview.net/pdf?id=rkgg6xBYDH
Code: None
Abstract: In this work, we develop the theory for analyzing the generalization performance of recurrent neural networks. We first present a new generalization bound for recurrent neural networks based on matrix 1-norm and Fisher-Rao norm. The definition of Fisher-Rao norm relies on a structural lemma about the gradient of RNNs. This new generalization bound assumes that the covariance matrix of the input data is positive definite, which might limit its use in practice. To address this issue, we propose to add random noise to the input data and prove a generalization bound for training with random noise, which is an extension of the former one. Compared with existing results, our generalization bounds have no explicit dependency on the size of networks. We also discover that Fisher-Rao norm for RNNs can be interpreted as a measure of gradient, and incorporating this gradient measure not only can tighten the bound, but allows us to build a relationship between generalization and trainability. Based on the bound, we theoretically analyze the effect of covariance of features on generalization of RNNs and discuss how weight decay and gradient clipping in the training can help improve generalization.
Keyword: generalization, recurrent neural networks, learning theory

The Shape of Data: Intrinsic Distance for Data Distributions
Author: Anton Tsitsulin, Marina Munkhoeva, Davide Mottin, Panagiotis Karras, Alex Bronstein, Ivan Oseledets, Emmanuel Mueller
link: https://openreview.net/pdf?id=HyebplHYwB
Code: https://github.com/xgfs/imd
Abstract: The ability to represent and compare machine learning models is crucial in order to quantify subtle model changes, evaluate generative models, and gather insights on neural network architectures. Existing techniques for comparing data distributions focus on global data properties such as mean and covariance; in that sense, they are extrinsic and uni-scale. We develop a first-of-its-kind intrinsic and multi-scale method for characterizing and comparing data manifolds, using a lower-bound of the spectral variant of the Gromov-Wasserstein inter-manifold distance, which compares all data moments. In a thorough experimental study, we demonstrate that our method effectively discerns the structure of data manifolds even on unaligned data of different dimensionalities; moreover, we showcase its efficacy in evaluating the quality of generative models.
Keyword: Deep Learning, Generative Models, Nonlinear Dimensionality Reduction, Manifold Learning, Similarity and Distance Learning, Spectral Methods

How to 0wn the NAS in Your Spare Time
Author: Sanghyun Hong, Michael Davinroy, Yiǧitcan Kaya, Dana Dachman-Soled, Tudor Dumitraş
link: https://openreview.net/pdf?id=S1erpeBFPB
Code: https://github.com/Sanghyun-Hong/How-to-0wn-NAS-in-Your-Spare-Time
Abstract: New data processing pipelines and novel network architectures increasingly drive the success of deep learning. In consequence, the industry considers top-performing architectures as intellectual property and devotes considerable computational resources to discovering such architectures through neural architecture search (NAS). This provides an incentive for adversaries to steal these novel architectures; when used in the cloud, to provide Machine Learning as a Service (MLaaS), the adversaries also have an opportunity to reconstruct the architectures by exploiting a range of hardware side-channels. However, it is challenging to reconstruct novel architectures and pipelines without knowing the computational graph (e.g., the layers, branches or skip connections), the architectural parameters (e.g., the number of filters in a convolutional layer) or the specific pre-processing steps (e.g. embeddings). In this paper, we design an algorithm that reconstructs the key components of a novel deep learning system by exploiting a small amount of information leakage from a cache side-channel attack, Flush+Reload. We use Flush+Reload to infer the trace of computations and the timing for each computation. Our algorithm then generates candidate computational graphs from the trace and eliminates incompatible candidates through a parameter estimation process. We implement our algorithm in PyTorch and Tensorflow. We demonstrate experimentally that we can reconstruct MalConv, a novel data pre-processing pipeline for malware detection, and ProxylessNAS-CPU, a novel network architecture for the ImageNet classification optimized to run on CPUs, without knowing the architecture family. In both cases, we achieve 0% error. These results suggest hardware side channels are a practical attack vector against MLaaS, and more efforts should be devoted to understanding their impact on the security of deep learning systems.
Keyword: Reconstructing Novel Deep Learning Systems

Enabling Deep Spiking Neural Networks with Hybrid Conversion and Spike Timing Dependent Backpropagation
Author: Nitin Rathi, Gopalakrishnan Srinivasan, Priyadarshini Panda, Kaushik Roy
link: https://openreview.net/pdf?id=B1xSperKvH
Code: https://github.com/nitin-rathi/hybrid-snn-conversion
Abstract: Spiking Neural Networks (SNNs) operate with asynchronous discrete events (or spikes) which can potentially lead to higher energy-efficiency in neuromorphic hardware implementations. Many works have shown that an SNN for inference can be formed by copying the weights from a trained Artificial Neural Network (ANN) and setting the firing threshold for each layer as the maximum input received in that layer. These type of converted SNNs require a large number of time steps to achieve competitive accuracy which diminishes the energy savings. The number of time steps can be reduced by training SNNs with spike-based backpropagation from scratch, but that is computationally expensive and slow. To address these challenges, we present a computationally-efficient training technique for deep SNNs. We propose a hybrid training methodology: 1) take a converted SNN and use its weights and thresholds as an initialization step for spike-based backpropagation, and 2) perform incremental spike-timing dependent backpropagation (STDB) on this carefully initialized network to obtain an SNN that converges within few epochs and requires fewer time steps for input processing. STDB is performed with a novel surrogate gradient function defined using neuron’s spike time. The weight update is proportional to the difference in spike timing between the current time step and the most recent time step the neuron generated an output spike. The SNNs trained with our hybrid conversion-and-STDB training perform at $10{\times}{-}25{\times}$ fewer number of time steps and achieve similar accuracy compared to purely converted SNNs. The proposed training methodology converges in less than $20$ epochs of spike-based backpropagation for most standard image classification datasets, thereby greatly reducing the training complexity compared to training SNNs from scratch. We perform experiments on CIFAR-10, CIFAR-100 and ImageNet datasets for both VGG and ResNet architectures. We achieve top-1 accuracy of $65.19\%$ for ImageNet dataset on SNN with $250$ time steps, which is $10{\times}$ faster compared to converted SNNs with similar accuracy.
Keyword: spiking neural networks, ann-snn conversion, spike-based backpropagation, imagenet

BREAKING CERTIFIED DEFENSES: SEMANTIC ADVERSARIAL EXAMPLES WITH SPOOFED ROBUSTNESS CERTIFICATES
Author: Amin Ghiasi, Ali Shafahi, Tom Goldstein
link: https://openreview.net/pdf?id=HJxdTxHYvB
Code: None
Abstract: Defenses against adversarial attacks can be classified into certified and non-certified. Certifiable defenses make networks robust within a certain $\ell_p$ -bounded radius, so that it is impossible for the adversary to make adversarial examples in the certificate bound. We present an attack that maintains the imperceptibility property of adversarial examples while being outside of the certified radius. Furthermore, the proposed “Shadow Attack” can fool certifiably robust networks by producing an imperceptible adversarial example that gets misclassified and produces a strong ``spoofed’’ certificate.
Keyword: None

Query-efficient Meta Attack to Deep Neural Networks
Author: Jiawei Du, Hu Zhang, Joey Tianyi Zhou, Yi Yang, Jiashi Feng
link: https://openreview.net/pdf?id=Skxd6gSYDS
Code: None
Abstract: Black-box attack methods aim to infer suitable attack patterns to targeted DNN models by only using output feedback of the models and the corresponding input queries. However, due to lack of prior and inefficiency in leveraging the query and feedback information, existing methods are mostly query-intensive for obtaining effective attack patterns. In this work, we propose a meta attack approach that is capable of attacking a targeted model with much fewer queries. Its high query-efficiency stems from effective utilization of meta learning approaches in learning generalizable prior abstraction from the previously observed attack patterns and exploiting such prior to help infer attack patterns from only a few queries and outputs. Extensive experiments on MNIST, CIFAR10 and tiny-Imagenet demonstrate that our meta-attack method can remarkably reduce the number of model queries without sacrificing the attack performance. Besides, the obtained meta attacker is not restricted to a particular model but can be used easily with a fast adaptive ability to attack a variety of models. Our code will be released to the public.
Keyword: Adversarial attack, Meta learning

Massively Multilingual Sparse Word Representations
Author: Gábor Berend
link: https://openreview.net/pdf?id=HyeYTgrFPB
Code: https://github.com/begab/mamus
Abstract: In this paper, we introduce Mamus for constructing multilingual sparse word representations. Our algorithm operates by determining a shared set of semantic units which get reutilized across languages, providing it a competitive edge both in terms of speed and evaluation performance. We demonstrate that our proposed algorithm behaves competitively to strong baselines through a series of rigorous experiments performed towards downstream applications spanning over dependency parsing, document classification and natural language inference. Additionally, our experiments relying on the QVEC-CCA evaluation score suggests that the proposed sparse word representations convey an increased interpretability as opposed to alternative approaches. Finally, we are releasing our multilingual sparse word representations for the 27 typologically diverse set of languages that we conducted our various experiments on.
Keyword: sparse word representations, multilinguality, sparse coding

Monotonic Multihead Attention
Author: Xutai Ma, Juan Miguel Pino, James Cross, Liezl Puzon, Jiatao Gu
link: https://openreview.net/pdf?id=Hyg96gBKPS
Code: None
Abstract: Simultaneous machine translation models start generating a target sequence before they have encoded or read the source sequence. Recent approach for this task either apply a fixed policy on transformer, or a learnable monotonic attention on a weaker recurrent neural network based structure. In this paper, we propose a new attention mechanism, Monotonic Multihead Attention (MMA), which introduced the monotonic attention mechanism to multihead attention. We also introduced two novel interpretable approaches for latency control that are specifically designed for multiple attentions. We apply MMA to the simultaneous machine translation task and demonstrate better latency-quality tradeoffs compared to MILk, the previous state-of-the-art approach.

Keyword: Simultaneous Translation, Transformer, Monotonic Attention

Gradients as Features for Deep Representation Learning
Author: Fangzhou Mu, Yingyu Liang, Yin Li
link: https://openreview.net/pdf?id=BkeoaeHKDS
Code: None
Abstract: We address the challenging problem of deep representation learning – the efficient adaption of a pre-trained deep network to different tasks. Specifically, we propose to explore gradient-based features. These features are gradients of the model parameters with respect to a task-specific loss given an input sample. Our key innovation is the design of a linear model that incorporates both gradient and activation of the pre-trained network. We demonstrate that our model provides a local linear approximation to an underlying deep model, and discuss important theoretical insights. Moreover, we present an efficient algorithm for the training and inference of our model without computing the actual gradients. Our method is evaluated across a number of representation-learning tasks on several datasets and using different network architectures. Strong results are obtained in all settings, and are well-aligned with our theoretical insights.
Keyword: representation learning, gradient features, deep learning

Pay Attention to Features, Transfer Learn Faster CNNs
Author: Kafeng Wang, Xitong Gao, Yiren Zhao, Xingjian Li, Dejing Dou, Cheng-Zhong Xu
link: https://openreview.net/pdf?id=ryxyCeHtPB
Code: None
Abstract: Deep convolutional neural networks are now widely deployed in vision applications, but a limited size of training data can restrict their task performance. Transfer learning offers the chance for CNNs to learn with limited data samples by transferring knowledge from models pretrained on large datasets. Blindly transferring all learned features from the source dataset, however, brings unnecessary computation to CNNs on the target task. In this paper, we propose attentive feature distillation and selection (AFDS), which not only adjusts the strength of transfer learning regularization but also dynamically determines the important features to transfer. By deploying AFDS on ResNet-101, we achieved a state-of-the-art computation reduction at the same accuracy budget, outperforming all existing transfer learning methods. With a 10x MACs reduction budget, a ResNet-101 equipped with AFDS transfer learned from ImageNet to Stanford Dogs 120, can achieve an accuracy 11.07% higher than its best competitor.
Keyword: transfer learning, pruning, faster CNNs

2020年 ICLR 国际会议最终接受论文(poster-paper)列表(四)

你可能感兴趣的:(学习资料分享)