干货 | NIPS 2015 Reasoning, Attention, Memory Workshop

今天分享的是 NIPS 2015 的 workshop 之一,Reasoning Attention Memory(RAM)中的 accepted paper。这个 workshop 中有几个看点,一个是请来了一些非 DL 的研究者,比如 cognitive science 方向的,带来了生物学角度的 memory 研究;二来是有很多开创性工作,比如公布了一个数据集,或者发表了对未来某个领域的一些展望。所以这个 workshop 整体来看,模型理论不复杂,论文也相对简单,我的分享也就会加入许多自己的看法。

今天的分享包括:

How to learn an algorithm Juergen Schmidhuber, IDSIA.

From Attention to Memory and towards Longer-Term Dependencies Yoshua Bengio, University of Montreal.

Generating Images from Captions with Attention Elman Mansimov, Emilio Parisotto, Jimmy Lei Ba, Ruslan Salakhutdinov (University of Toronto).

Smooth Operators: the Rise of Differentiable Attention in Deep Learning Alex Graves, Google Deepmind.

Sleep, learning and memory: optimal inference in the prefrontal cortex Adrien Peyrache, New York University.

Dynamic Memory Networks for Natural Language Processing Ankit Kumar, Ozan Irsoy, Peter Ondruska, et al. (MetaMind).

Chess Q&A : Question Answering on Chess Games Volkan Cirik, Louis-Philippe Morency, Eduard Hovy (CMU).

Towards Neural Network-based Reasoning Baolin Peng, Zhengdong Lu, Hang Li, et al. (Noah's Ark Lab).

Neural Machine Translation: Progress Report and Beyond Kyunghyun Cho, New York University.

A Roadmap towards Machine Intelligence Tomas Mikolov, Facebook AI Research.

 How to learn an algorithm


第一个 talk 没有 slides 没有 paper。但是 Juergen Schmidhuber 教授在今年10月也个过一个同题目的 talk,视频曾被爱可可老师分享在百度云里过。如果懒得看视频可以看14年的surveyDeep Learning in Neural Networks: An Overview》。

视频里主要都是介绍 DL 的过去,现在和将来,没有对某一个模型和算法的具体介绍。挺像 TED 那种技术发展介绍的,有兴趣可以看看。再附个别人的笔记:

Juergen gave a brief history of RNNs. RNNs are general purpose computers, and learning a program boils down to learning a weight matrix. An LSTM is an instance of an RNN, good for learning long-term relationships. Juergen was giving a short history of LSTMs, starting with supervised learning, but also RL (e.g., Bakker et al., IROS 2003). LSTMs are now used for all kinds of things (speech recognition, translation, ….). Then he made this incredibly funny comment that

Google already existed 13 years ago, but it was not yet aware that it was becoming a big LSTM.

Juergen’s first deep network back in 1991 was a stack of RNNs (Hierarchical Temporal Memory). Deep learning was started by Ivakhnenko in 1965, backpropagation by Linnainmaa in 1970, and Werbos was the first one to apply this idea to NNs back in 1982.

Unsupervised learning is basically nothing but compression.

In the context of life-long meta learning, Juergen mentioned the Goedel machine. The Goedel machine (2003) is a theoretically optimal self-improver, which kind of ticks all kinds of boxes for life-long learning. It provides a solution to the towers-of-Hanoi problem: Learn context-free and context-sensitive languages, and finally learn to solve this problem.

In 2013, he used RNNs for learning from raw videos “RL in partially observable worlds” using compressed network search.

The talk was more an interesting history lesson than a discussion of current research activities, but nevertheless very interesting.

Future directions: Learn subgoals in a hierarchical fashion; learns faster.



  From Attention to Memory and towards Longer-Term Dependencies

这个也是没有找到对应的 slides 和 paper(相关 paper 太多了),附上别人的笔记和我的理解。

Long-Term Dependencies

One of the reasons the attention mechanisms help is connected to the problem of long-term dependencies (that is, when learning a composition of many nonlinear functions by measuring a loss which depends on all of the nonlinearities, the derivatives can become very small or large depending on the eigenvalues of the Jacobians). If you want a recurrent network to store information reliably, you need some kind of attractors in the dynamics (Jacobians with eigenvalues less than 1).The problem with that is that if they are contractive, it also means that you will have gradient vanishing. So, the condition that requires that RNNs are learnable seems to imply that you must be forgetting things. One of the paths to improving this were LSTMs, which introduces loops in the state-to-state transitions where the derivatives are slightly less than one. An alternate early approach are skip connections or a hierarchy of timescales小S 批注:skip connections 也叫 shortcut connections,推荐过很多次的 Highway Networks 就是利用 shortcut connections + gate 实现的;最近爆火的 MSRA ILSVRC 1st place 的工作,Deep Residual Network 可以说 Highway Network 的一个特例,所以也是充分利用了 shortcut connections;而 hierarchy of timescales 的例子也在前几天分享过,Facebook 的 Laplacian pyramid GAN 的工作。

Considering a memory content which be read and written to, for many of the locations in the memory at each time step very little will happen because a softmax is used which usually only selects a few elements. This is similar to how in an LSTM the memory is copied across time. For this to work, however, the memory needs to be large enough so that it doesn't have to read and write from the same locations very often. The idea of a “copy” (preserve information) can be generalized a little bit to an operation where the eigenvalues of the Jacobian are 1. This all suggests that networks with large memories are valuable, but these make the networks more expensive.


  Generating Images from Captions with Attention

这篇论文结合和扩展了两种 generative NN model,一个是 DRAW differential attention mechanism,还有一个是 post-processing deterministic Laplacian pyramid adversarial network (GAN)。在此基础上,这篇工作的另一个创新之处在于,把 image -> generate caption (text) 的过程反过来了,用 caption generate image

如果把上述三点分别展开来说,那么:

1)这个工作中对于 DRAW 的改进是,使得 generation 时,变成了 conditional 的,也就是变成了一种 conditional generative model,参加见 Figure 2 中,最右上方的 p(x|y, Z_{1:T})。这个 conditional 的改变,对于 performance 提升很有帮助。具体来说,


2)对于 GAN 几乎可以说没有改进,只是将其与 DRAW 结合在一起,替代了 inference part,用来 sharpen generated images (scenes),使得图像看起来更清晰。这里要注意,这个 inference part 其实就是 reconstruction loss,个人感觉这是因为 generative model 本身很多时候不够 powerful,或者训练集不足以支撑(关于 generative model power 的问题见下面关于 Smoother Operator 的讨论)。但从现在的 generate sample 来看,这样做并不具备改变图片的语义的功能,感觉只是单纯的“锐化”……

3)从 caption (text) -> generate image 的过程,应该来说是更加困难的。在这个过程中,attention mechanism 作为了一种 hint,当 attention 分配在 caption 中的某些强语义 words 时,图片中就会有非常 clear image object 对应;但如果 attention “失败了,则 image 中完全不会出现这个 object。作者也在实验分析中,进行了各种 image 元素的替换对比,用于揭示这个 conditional align attention model 的优势。比如其中之一就是,在此基础上,可以通过轻松替换正确被 attention wordsgenerate 出训练集中完全没有出现过的反生活的图像组合。


当然,就像之前说的,caption -> image generation 的过程还是比较难的。比如尽管 Deep Learning ”号称已经可以区分 cat dog,但是在这样 multi-modal 的情况下,按照 caption 生成准确的 dog or cat,看起来还是非常困难的。小S 认为,除了 image 本身的复杂性,还有现有 multi-modal framework 的复杂性的因素,因为尽管一边是 differential 的,一边是 deterministic 的,依然很 complex……

但是,总的来说,这篇论文写得很清晰,即使没有看过 attention model,没有看过 image caption generation 文章的人,也可以以此入门了解 framework



 Smooth Operator: the Rise of Differentiable Attention in Deep Learning

这个 talk 来自 Google DeepMind 团队的 Alex Graves,网上并没有这个题目的 slides。但是 Graves 是 DRAW 的作者之一,并且 DRAW 的一大贡献就是,是第一个成功学出 fully differentiable attention policy for sequence generation 的工作。所以我先提一下 DRAW 的工作,再来说下相关的 differentiable attention 工作,最后来整理一下我看到的关于这个 talk 的笔记。


DRAW 的工作首先见于 ICML 2015,arXiv 文章DRAW:A Recurrent Neural Network For Image开源!有代码!


文章开篇就用短短三行解释了 motivation:

A person asked to draw, paint or otherwise recreate a visual scene will naturally do so in a sequential, iterative fashion, reassessing their handiwork after each modification. Rough outlines are gradually replaced by precise forms, lines are sharpened, darkened or erased, shapes are altered, and the final picture emerges.

啥意思,我们人类画画不是一次到位的!要不断的修正,一遍遍的完善细节啊等等。所以呢,为啥我们要求 NN 一步到位的给我们 generate 出一幅完整(entire)的图?这!不!科!学!(汗,画风不太对了,我严谨一点)于是乎,作者就想模拟这个一遍遍的过程。每次我们只画一丢丢,只着重于某一个 part——然后大家就说了,你这个不新鲜啊,attention mechanism 不是也这个意思么,但是 attention model 是啥呢,是在 decode 的过程中,最后的过程中,再去决定,而不是在反复画的过程中:

The main challenge faced by sequential attention models is learning where to look, which can be addressed with reinforcement learning techniques such as policy gradients (Mnih et al. , 2014 ). The attention model in DRAW, however, is fully differentiable, making it possible to train with standard backpropagation. In this sense it resembles the selective read and write operations developed

for the Neural Turing Machine.

所以他们就提出了这个 DRAW——Deep Recurrent Attentive Writer 的 model。recurrent 表示一遍遍去“修正”完善,attentive 表示每次只着重于某个 part。而且文章不仅配图丰富(在许多个图像数据集如 MINST 上做了实验),还有视频套餐(地址见论文)。视频中可以动态的看出,这个 DRAW 是如何把一个个 digit “写”出来的。


下面是我对于 differentiable attention 的一些理解。这篇 DRAW 的工作,主要是基于 Variational AutoEncoder(VAE)框架,去做对于 deep generative model 的 inference。在 deep generative model 的 inference 上,各种理论和模型一直在发展。现在 VAE 框架发展得最快(具体下次再总结),而 deep generative model 不仅被用在 image-text 模型上,也被用在 speech-text 模型上,甚至说在1998年就出现过 deep generative model for speeach-text 的工作。但是当时,inference, computation, data noise 等等问题,都限制了这方面工作的发展。如今,computation 不再是问题,inference 也在蓬勃发展,才有了 deep generative model 的复兴。个人感觉,这方面的工作主要分为两类,一类是继续改进优化 inference algorithm,一类是结合 generative 和 discriminative,互补。第一类不用多说,第二类涉及到的其实是两类 attention mechanism。第一类,就是我们最熟悉的 soft attention,比如Neural Machine Translation by Jointly Learning to Align and Translate》里的。这种 soft attention,是可以直接在 back propagation 求导(differentiable)的,但是缺点就是它要一步步检测所有 input location(对于 text 就是所有 words,对于 image 就是所有 pixels)。这个计算度就非常高了。相反,第二类 hard attention 是 stochastic 的,是 sampling-based,就可以只 attend input 的某一个范围,计算量就会小很多。但是缺点是,如何做这个 sampling,sampling form 是什么,就需要好好斟酌和考虑。显然,两类是互补的。而且,1998年 的工作经验表明,有时候单做 generative task 表现会不好,却可以通过 discriminative 任务来辅助提高。于是乎,现有的一些任务也致力于结合 soft + hard。比如 Wake-Sleep 工作。


关于更多 variational autoencoder for deep generative model 的工作介绍,可以看这次 NIPS 2015 Yoshua Bengio 教授的 Tutorial 最后部分。


最后,摘录一则笔记,并没有太多新内容:

Explicit Attention

Explicit attention can be helpful because it limits the data presented to the network in some way - it can be more computationally efficient, more scalable (same amount of input even when the input is larger), it doesn't have to learn to ignore things, and it can also be used for sequential processing of static data (you can turn anything into a sequence - which is full circle from when people used to turn sequential data into fixed-length vectors and statistics). Using “hard” attention usually requires using some kind of reinforcement learning technique, which is required in some settings. However, in other settings it is possible to use a “soft” attention which is differentiable so that we can use end-to-end training. One possibility is to have the attention mechanism to output the parameters of a probability distribution to decide which locations in the sequence to attend to (from Graves' 2013 paper). This is allows for visualization of the alignment of the produced output to the input data. Alternatively, you can select by content, e.g. output a “key vector” which is compared to all data using some similarity function which is then used to compute a probability distribution over the input which decides what to attend to. For example, the input can be embedded in some space, and then the attention mechanism decides which embeddings are attended to. This can even be effective when a location-based attention would seem more natural, such as “very sequential” data.

Introspective Attention

It's also useful to selectively attend to the network's internal state or memory, e.g. selectively writing rather than writing the data all at once and then deciding what to read(小S 批注:比如 NTM, DRAW 和 Impatient Writer). This may make it easier to build complex structures. In the Neural Turing Machine, this is done by first emitting an erase vector and then emitting an add vector. This can be effective for copying tasks, where the number of copies is variable, where it is important to keep track of how many copies have been produced. In a different example, the neural programmer interpreter avoids the fact that the network must learn from scratch how and what to attend to by being told exactly what to attend to at each point in time. This allows training to happen faster by explicitly giving the model the procedure it should use, rather than just lots of training data.

Visual Attention

In DRAW, a grid of Gaussian filters is used both to read and write from the “canvas” image, which produces a 2-dimensional memory. The center of the Gaussians, their variance, and the “strength” of the focus is controlled over time. This allows generating images to happen iteratively, either iteratively sharpening an image or starting from a blurry image and making it sharper. The resulting system is compositional - once it learns to generate one thing, it can generate an arbitrary number of them. Related are spatial transformer networks, which(小S 批注:在 DL Symposium 的分享里介绍过) parameterize the sampling grid as a spline mapping with affine transformations which allows for a nonlinear warping which can un-distort input images. This works in more than two dimensions, too.



 Sleep, Learning and Memory: Optimal Inference in the Prefrontal Cortex

这篇工作主要是介绍认知方面的进展,以老鼠为被试,去观测老鼠前额叶皮质中的激活,用这样的激活来验证一个假设——这个假设和现在的 memory network 中的 claim 很相似。那就是,我们本身的自发性行为(spontaneous)是基于 prior distribution 的,而应激性行为(evoked)是基于 posterior distribution 的,并且,随着老鼠的活动(Wake)过程,这两种分布会越来越趋近。它们通过实验和各种统计指标,验证了这个假设。并且因此推断,老鼠前额皮质中的激活确实是 Inference-by-sampling 的。这个推论个人认为才是最重要的。对于未来的 memory network learning policy 设计会有一定帮助。


 Dynamic Memory Networks for Natural Language Processing

这个工作就是《Ask me anything》那篇论文,也是相当早就放在 arXiv 上了,ACL 2015 的论文中就已经有人引用。文章来自 Richard Socher MetaMind 团队。主要就是利用一个 dynamic memory networkDMN)框架去进行 QA(甚至是 Understanding Natural Language)。


这个框架是由几个模块组成,可以进行 end-to-end 的 training。其中核心的 module 就是 Episodic Memory module,可以进行 iterative 的 semantic + reasoning processing。DMN 先从 input 接受 raw input(question),然后生成 question representation,送给 semantic memory module,semantic module 再将 question representation + explicit knowledge basis(只是设想) 一起传给核心的 Episodic Memory module。这个 Episodic Memory module 会首先 retrieve question 中涉及到的 facts 和 concepts,再逐步推理得到一个 answer representation。由于可能有多个涉及到的 facts 和 questions,所以这里还用到了 attention mechanism。最后,Answer Module 就可以用接收到的 answer representation 去 generate 一个真正的 answer。


 Chess Q&A: Question Answering on Chess Games

这篇和上面的《Ask me anything》一起看效果加成。文章来自 CMU,也是贡献了一个 dataset。这个 dataset 是关于 reasoning on Chess 的,但是其中包括的 sub question 问题类型和 FB20 非常像:关于 number 的, location 的,关于 sequence of moves 等等。但是 chess dataset 比 FB20 也许好的一点是,它里面涵盖的 common knowledge 会比较少(limited knowledge requirement)。之前有提过 ICLR 2016 submission 的工作《Reasoning in Vector Space: An Exploratory Study of Question Answering》里就需要 predefine 一些 common knowledge,比如北边是南边的对立。但是 chess 里面,这类知识就会相对少。所以这个 dataset 没准也会火起来呢。


  Towards Neural Network-based Reasoning

这篇来自诺亚方舟团队,利用的也是 FB20 的 dataset 做 reasoning。和上面的提到的 ICLR 2016 submission 的工作《Reasoning in Vector Space: An Exploratory Study of Question Answering》一样,他们也认为 FB20 中最难的是 path finding 和 positional reasoning。而他们在这篇工作中提出的 Neural Reasoner,可以在这两个任务上达到将近 98% 的正确率(比以前其他模型的<60% 高出一大截)。虽然不如 ICLR 2016 submission 的 100% 高,但也很厉害。

模型上他们的改进主要是一个 reasoning layer,这个 layer 在 general 层面上实现 fact, question 之间的 interaction 后 update representation(但是在实验中没有用到)。如果就到这里,实验结果并不会很好。

在这篇工作中,一个更重要的改进是他们在 reasoning task objective 的同时,加入了 reconstruction loss objective,就是说,我们学出来的 semantic representation,不仅要 reasoning 很好,而是(至少)要能 reconstruct 出 original question。我个人是非常认同这件事的。虽然这是一个 intuition 的分析,但是我们也可以找到一些对应的理论。在现在高速发展的 Variational AutoEncoder(VAE)的 objective function 中,也有 reconstruction error 一项,即 log p(x|z)。VAE 中的另两项可以合并看坐 regularized term dictated by the variational bound。这样看来,VAE 也可以看成 bound + reconstruction 的共同优化。而 bound 的优化就是 task objective 的转换。所以,个人认为,Neural Reasoner 加入了 reconstruction error 后 performance 显著提高,并不是一个个例。

最后,Neural Networks 中各种 error/loss 的功能在 ICLR 2016 的《Stacked What-Where Auto-Encoders》(SWWAE)论文中有很不错的讨论。Yoshua Bengio 教授这次在 NIPS 2015 的 Tutorial 里也推荐了 SWWAE 的工作哟。

Neural Machine Translation: Progress Report and Beyond

来自 NYU 新晋 Assistant Professor Kyunghyun Cho 的 presentation。这个 presentation 的 slides 之前在网上有放出来过(ACL workshop),大家可以去搜一下题目。主要 focus on 13-15年来,各种 Neural-based MT 的工作(encoder-decoder)。比如,最开始大家用 RNN encoder-decoder 来做 MT,发现效果大部分时候是不错的。然而,遇到 long sentences 的时候,就会很崩溃。究其原因,主要是 RNN 对于 long information 的 representation(或者说只基于 vanilla RNN)capacity 不够好。于是,就有了著名的《Neural Machine Translation by Jointly Learning to Align and Translate》 这篇 attention-based MT 工作。有了 attention 机制后,MT 的 performance 有了很大的提高。但是另一个问题依然没有得到解决,computation cost 依然很大,速度太慢。出于此,便有了 15年的工作《On Using Very Large Target Vocabulary for Neural Machine Translation》,提出用部分词表近似大词表 MT 效果。在此之上,要进一步提高 MT 的效果,就只能从训练语料入手。训练 MT 模型的 sentence pair 语料总归是比较少的,但是 monolingual 的语料相对就非常庞大。一个自然的想法就是,如何把 monolingual corpus 中的 knowledge 运用在 MT 中,让 monolingual corpus 中的 language model 学习来帮助 MT 的 language model 学习?这就是《On Using Monolingual Corpora in Neural Machine Translation》的工作。到此为止,Neural-based MT 的效果已经可以和过去 phrase-based MT 匹敌了。What's next?大家自己去看 slides 吧:)


 A Roadmap towards Machine Intelligence

Word embedding “创始人” Mikolov 前阵子的一个 arXiv preprint。当时在微博上的评价是,全程无公式,可以当科幻小说看……作为一个 SF fan,我就看了……看了……看了……然后好失望。小说不小说可以不提,但是内容真的感觉没什么新鲜的。如果要评价,大概是各种 Artificial Intelligence course lecture 的集成版?还得是 Introduction to Artificial Intelligence 那种课。因为实在无感,不知道如何推荐……大家有不同意见欢迎在本文最右下角【写评论】,这样别人也可以看到,一起交流想法。

这样,NIPS 2015 RAM workshop 也分享完毕啦。与此同时,NIPS 2015 大会也落幕了,新的征程再度开启了……关于主会的 paper 会下次分享。然而!然而!CV 盛会 ICCV 2015 开始啦。所以,【重磅预告】【重磅预告】【重磅预告】,周三会由 大S 来分享 ICCV 2015 的干货。NIPS 就顺延~感谢大家的各种讨论和打赏!~\(≧▽≦)/~

其他相关文章,可回复代码(如【GH034】)或点击文章名跳转阅读(已加链接):

GH022 Applied Attention-Based Models(二)

GH030 Multi-modal Deep Learning 初窥

GH032 ICLR 2016 Submission Highlights

GH034 NIPS 2015 Deep Learning Symposium(一)

GH035 NIPS 2015 Deep Learning Symposium(二)

你可能感兴趣的:(DL)