Deepmind讲座:深度学习中的记忆和注意力 注意力机制发展史与详解

DeepMind x UCL | Deep Learning Lectures | 8/12 | Attention and Memory in Deep Learning(机翻讲稿)


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-pAQA7P4K-1604326803666)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727230815768.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-U6L3a49U-1604326803667)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/mqdefault_6s.webp)]

您好,欢迎来到UCLX DeepMind讲座系列。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-dF3sooMU-1604326803668)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200726212842206.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-3JZlx5ho-1604326803670)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200726211607779.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-gdhsL5Co-1604326803671)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727214029023.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-skSq6OAM-1604326803672)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200726211752909.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-FYrLhjkj-1604326803672)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727213702782.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-rRsJrkzj-1604326803673)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727213731952.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-RPLwfknz-1604326803673)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727213807065.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-A8pVBfaH-1604326803674)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727213900166.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-JJFT0VSf-1604326803674)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727214129415.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-xtnRrn1D-1604326803675)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727214236229.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-1riQ6KH9-1604326803675)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727214347417.png)]

因此,我选择的网络输出很重要,所以我应该说,这里的任务是让网络抄录这些笔迹,这些在线笔的位置,并某种程度地识别出该人写的是什么,并且看到那里此输出序列在此处发出标签决策“ o”,“ n”,“ c”,“ e”,然后是空格字符,并且在没有完全分类或转录的情况下会漏掉“ v”图像正确。
但是我们要看的是它决定在“ having”中输出字母“ i”的点,真正有趣的是,如果我们看一下连续的雅可比行列式,我们可以看到这里的敏感度达到峰值,大致对应于输入序列中笔画(实际上是字母“ i”的主体)的位置。
我相信这样做的原因是后缀“ ing”,“ having”末尾的“ ing”是很常见的。
因此,能够识别出整个后缀可以帮助您消除字母“ i”的歧义,例如,它可以告诉您那里不是“ l”。
真正有趣的是这个峰值,最后这个非常尖锐的峰值,与之相对应的是作家将笔从页面上抬下,从白板上抬起并回到点“ i”的点。
当然,该点对于识别“ i”至关重要,对吗?那才是真正区分“ i”和“ l”的东西。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-MI0hPTkq-1604326803676)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727214441110.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-e4BPsnbi-1604326803676)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727214534477.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-U0F9knaJ-1604326803677)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727214813360.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ca1GMAjo-1604326803677)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727214901817.png)]

因此,您知道,该模型通常的工作方式是,在给定一些关注输出的情况下,我们对数据“ x”的瞥见“ g”定义概率分布。
因此,我将注意力向量设置为“ a”,并将其用于参数化,例如给定“ a”时瞥见“ g”的可能性。
因此,这里最简单的情况是,我们将图像分成多个图块,在此右侧的图像中,您可以看到存在九个可能的图块,而“ a”只是通过一组隐含的瞥见将概率分配给每个这些图块正在使用这些图块之一。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-rv8A8abr-1604326803678)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727215049406.png)]

π a = Pr ⁡ ( g k ∣ a ) R = E g ∼ π a [ log ⁡ π a L ( g ) ] ∇ a R = E g ∼ π a [ ∇ a log ⁡ π a L ( g ) ] \begin{array}{c} \pi_{\mathbf{a}}=\operatorname{Pr}\left(\mathbf{g}_{k} \mid \mathbf{a}\right) \\ R=\mathbb{E}_{\mathbf{g} \sim \pi_{\mathbf{a}}}\left[\log \pi_{\mathbf{a}} L(\mathbf{g})\right] \\ \nabla_{\mathbf{a}} R=\mathbb{E}_{\mathbf{g} \sim \pi_{\mathbf{a}}}\left[\nabla_{\mathbf{a}} \log \pi_{\mathbf{a}} L(\mathbf{g})\right] \end{array} πa=Pr(gka)R=Egπa[logπaL(g)]aR=Egπa[alogπaL(g)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-GnxnWdHH-1604326803678)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727215805040.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-aJO4qG3m-1604326803679)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727215938692.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-WF0afhiC-1604326803679)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727220240595.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-J7uC6TJi-1604326803680)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727220338787.png)]再一次,这是关于能够忽略干扰物,能够忽略噪音的问题。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-R9IsTKE1-1604326803680)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727220510095.png)]



[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Fs14RQkU-1604326803681)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727220440248.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-9eYSOYW2-1604326803681)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727220616310.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-RsQsyI8m-1604326803682)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727220735714.png)]

而且由于这是一个加权和而不是样本,因此只要注意分布本身是可区分的(通常是可区分的),就注意参数“ a”而言,这整个事情可以直接进行区分。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-V4xRYhww-1604326803682)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727220822564.png)]

但是无论如何,如果我们看一下这个加权总和,这个注意力读数“ v”,这就是现在,如果我们认为,不要再考虑概率项,而只需考虑我们的“ i”乘以一些权重“ i”并乘以一些向量“ i”,这对您来说应该很熟悉,它实际上只是一个普通的求和,一个Sigma网络,一个来自普通神经网络的Sigma单位。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-CqLBGB71-1604326803683)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727221019682.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7sGRsvZz-1604326803683)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727221214024.png)]

因此,例如,我想要一种可以在输入序列中查看字母“ h”并将其用作条件信号的条件,以便在绘制字母“ h”并移至字母“ a”等时继续使用。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-HTN2JxPa-1604326803683)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727221406944.png)]

因此,您可以将其视为此处的“ h”和此处的“ a”,依此类推,然后网络决定将这些高斯放置在哪里,这意味着一旦我们在此处的顶部执行此求和,就意味着注意权重,我们应该看一下文本序列的哪一部分,以生成输出的分布。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-PNc7EzJH-1604326803684)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727221525589.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-RVYT6Kjx-1604326803684)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727222004686.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ki2f4ITM-1604326803685)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727222124605.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-P6nCfH8V-1604326803686)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727222203401.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Jh5tngVn-1604326803686)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727222522465.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-PgOFv4dS-1604326803687)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727222615633.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-P7nIvhls-1604326803687)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727225937851.png)]

而我们要完成的任务是,实体1到9,将已故的水手标识为X,而网络要做的就是填写X,您可以从此处的热图中看到尝试填写此内容时要使用的单词。 X。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-1m9c6Mwr-1604326803688)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727230057522.png)]

它知道在必须开始发出例如句子开头的“ st”字时,它非常专注于开头处与语音信号中的那些噪声相对应的这些声音。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-MPgaPr7v-1604326803689)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727230348598.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-dOnTqjLt-1604326803690)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727231147908.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-lgrntYEa-1604326803691)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727231205874.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4N0XaYxt-1604326803692)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727231601367.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-kkzr8A7A-1604326803692)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727231617053.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-yjTLkpOZ-1604326803693)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727231732495.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-TcmiEeMQ-1604326803694)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727231957905.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ph06nYOl-1604326803695)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727232409590.png)]

这种工作方式是,网络查看先前的权重并输出一个移位内核,该移位内核只是加号和减号“ n”之间的数字的softmax。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-tyWXPP1g-1604326803696)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727232522844.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-RYLE5wV1-1604326803697)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727232715276.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4gnZzdzD-1604326803698)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727232756552.png)]

因此,在这种情况下,我们受到长期短期记忆LSTM具有忘记和输入门的方式的启发,这些门能够修改存储器的内容,其内部状态的内容,因此我们定义了擦除向量“ e”,其作用类似于长期短期存储器中的“忘记日期”,其作用类似于输入门。
本质上发生的是,一旦写头确定了它要处理的矩阵中的哪些行,然后根据“ e”有选择地擦除这些行的内容。我应该在这里说:因此,“ e”基本上是一组介于零和一之间的数字。
并且,如果将“ e”设置为零,则存储矩阵保持原样。
如果是写向量,那么这里重要的是,如果矩阵’w [i]'的所有行的’w [i]'都很低,则什么也不会发生,什么也不会发生变化,所以如果不关注内存中的该部分,您也不会对其进行修改。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ak5ADvtT-1604326803698)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727233114185.png)]

因此,在右侧给出了它使用的一种伪代码版本的算法,但是我们也可以通过查看注意力的使用以及在此任务期间它对内存中特定位置的处理方式来对其进行分析。 。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-bysYaMy7-1604326803698)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727233740116.png)]



[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-3sRjZJGl-1604326803699)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727234108175.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-l5I66JQp-1604326803699)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727234234482.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4heAzE84-1604326803700)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727234541603.png)]



[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-UX4Kqk19-1604326803700)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727234716988.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-F1mFjFQA-1604326803701)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727234925520.png)]

我们能够提出问题,例如,您能找到Moorgate和Piccadilly Circus之间的最短路径吗?或者,当您从牛津广场开始并沿中线和圆线等等行驶时,可以执行遍历。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-g3qPZsWP-1604326803701)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727235007545.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-vkQrqrbh-1604326803702)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727235130205.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-p873u6kF-1604326803702)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200727235435119.png)]

因此,对于这个“制作”一词,我们正在观察中,虽然这个词正在处理中,但我忘记了这里的确切任务,但是在处理这个“制作”一词时,却遇到了许多不同的问题。单词:“法律”,“ 2009”,“制作”一词本身,但在序列末尾也使用“更困难”一词。
因此,所有这些事情都与在句子中使用“ make”一词的语义有关。
似乎是在寻找短语,所以“什么”,“什么”一词趋于“这就是什么”。 “法律”,“法律”一词代表着“法律”和“它是”等。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-DSOTcx6h-1604326803703)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200728000118990.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-UuQ4gEbE-1604326803703)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200728000155723.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-cFVwyBnO-1604326803703)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200728000208125.png)]从那时起,它变得越来越强大:现在它提供了语言建模的最新技术。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-HYmS1rPk-1604326803704)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200728000330707.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-AcDKL81n-1604326803704)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200728000541713.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-yUBISks0-1604326803705)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200728000737311.png)]


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6W1ZXzts-1604326803705)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200728001812436.png)]

一旦看到“ p”,“ e”,“ o”,“ p”,“ l”,就很容易预测到“ e”将要出现。
一旦您在“ e”后面看到一个空格,它就会变得更加困难。
因此,这不仅是网络在发现难以预测的事物时会思考更长的时间,而且在认为更长时间的思考可能会更好地进行预测时,会思考更长的时间。 。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-9SMCLrug-1604326803706)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200728002328586.png)]

因此,这种自适应计算时间的思想与这些Universal Transformer模型很好地结合在一起,因此在这种情况下,我们从bAbI数据集中得到了一个任务,该任务中有一系列句子,因此沿x轴显示的这些句子是就像是网络的输入,或者这些是网络需要了解的上下文,然后它会被问到问题。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-YXfa2ZRt-1604326803706)(DeepMind x UCL Deep Learning Lectures 812 Attention and Memory in Deep Learning.assets/image-20200728002430106.png)]



Hello and welcome to the UCLX DeepMind lecture series.
My name is Alex Graves.
I’m a research scientist at DeepMind.
And I’m going to be talking to you today about attention and memory in deep learning.
So you may have heard people talk about attention in neural networks and it’s really, it’s emerged over the last few years as a really exciting new component in the deep learning toolkit.
It’s one of the, in my opinion, it’s one of the last new things that’s been added to our toolbox.
So in this lecture we’re going to explain how attention works in deep learning.
And we’re also going to talk about the linked concept of memory.
And so you can think of memory in some sense as attention through time.
And so we’re going to talk about a range of attention mechanisms, those that are implicitly present in any deep network as well as more explicitly defined attention.
And then we’ll talk about external memory and what happens when you have attention to that and how that provides you with selective recall.
And then we’ll talk about Transformers and Variable Computation Time.
So I think the first thing to say about attention is that it is not something, it’s not only something that’s useful for deep learning, it plays a vital part in human cognition.
So the ability to focus on one thing and ignore others is really vital.
And so we can see this in our everyday lives.
We’re constantly bombarded with sensory information coming from all directions and we need to be able to pick out certain elements of that signal in order to be able to concentrate on them.
So a classical example of this is known as the cocktail party problem when people are attending a noisy party and listening to lots of other people talking at once, we’re still able to easily pick out one particular speaker and kind of let the others fade into the background and this is what allows us to hear what they’re saying.
But there’s also a kind of, a form of introspective or internal attention that allows us to attend to one thought at a time, to remember one event rather than all events.
And I think that the crucial thing that I want you to, the crucial idea that I want you to take away from this is that attention is all about ignoring things, it’s not about putting more information into a neural network, it’s actually about removing some of the information so that it’s possible to focus on specific parts.
Now I know you’ve all heard about neural networks and how they work and it might seem at first glance that there’s nothing about a neural network that is particularly related to this notion of attention.
So we have this, you know, this big non-linear function approximator that takes vectors in and gives vectors out and so in this kind of paradigmatic example, you have an image coming in, being processed and then a classification decision coming out.
Is it a leopard or a jaguar or a cheetah in this image? And this doesn’t appear to have much to do with attention at first glance, the whole image is presented to the network and then a single decision is made.
But, what you can actually find if you look inside neural networks and analyse what they’re actually doing with the data is that they already learn a form of implicit attention, meaning that they respond more strongly to some parts of the data than others.
And this is really crucial.
So if you want to distinguish, you know, a leopard, for example, from a tiger or something like that, part of what you need to focus on are the spots in the leopards fur and you need to do that, you need to focus on these parts while ignoring perhaps irrelevant detail in the background.
And to a first approximation we can study this use of implicit attention by looking at the network Jacobian.
So the Jacobian is basically the sensitivity of the network outputs with respect to the inputs.
So mathematically it’s really just a matrix of partial derivatives where each element Jᵢⱼ is the partial derivative of some output unit, i, with respect to some input unit, j, and you can compute this thing with ordinary backprop.
So basically the backprop calculation that’s used for gradient descent can be repurposed to analyse the sensitivity of the network.
All you do is you, instead of passing through the errors with respect to some loss function, you set the errors equal to the output activations themselves and then you perform backprop.
And by doing this, we get a feel for what the network, what pieces of information the network is really focusing on, what it’s using in order to solve a particular task.
So by way of illustration, here’s a neural network that’s, it’s known as the Dueling Network.
This is from an architecture presented in 2015, that was used for reinforcement learning.
Now it’s, a network that was applied to playing Atari games and so the input is a video sequence and the output in this case, the network has a two headed output.
One head attempts to predict the value of the state, as is kind of normal for reinforcement learning, for deep reinforcement learning.
The other head attempts to predict the action advantage.
So which is basically the differential between the value given a particular action and the expected value overall.
Or to put it in simpler terms, it tries to guess whether performing a particular action will make its value higher or lower.
And so if we look at the video here, this image on the left represents the Jacobian and with respect to the value prediction and what’s being shown here, so we’re seeing the input video itself, this is a racing game where the goal is to try and overtake as many cars as possible without crashing and overlaid on that, this red heatmap that we see flaring up, this is the Jacobian so the places that are appearing in red are the places that the network is sensitive to.
So if we concentrate on the left side of this video we can see some things that the network is really interested in.
So one of them is it tends to focus on the horizon.
The car is appearing, you know, just appearing on the screen.
And of course these are very important as a predictor of how much score the network is likely to obtain in the near future because it’s by overtaking these cars that it gets points.
It’s also continually focused on the car itself and obviously that’s important cause because needs to know its own state in order to predict its value.
And interestingly, it has another area of, kind of, continual focus, which is the score at the bottom.
So because it’s the score that it’s attempting to predict the score is the value for these games, it kind of makes sense that knowing what the current score is is very important.
That’s what gives it an indicator of how fast the value is accumulating.
If we look at the image on the right, which is also a Jacobian plot, but this time it’s the Jacobian of this action advantage, so the degree to which any one particular action’s better or worse than the expectation over other actions, we see a very different picture.
First of all, we see that there’s less sensitivity overall.
The Jacobian, these reds areas of sensitivity are a lot less prevalent and when we do show up, when they do show up, they tend to show up in different places.
They’re not looking so much at the horizon, they’re not looking at the score very much, they tend to flare up just in front of the car that’s driving.
And the reason for that is that the information that needs to decide whether it’s better to go right or left is really the information about the cars that are very close to it.
So that’s the point.
It’s only really when it comes close to another car that it has this critical decision about whether it should go right or left.
And so what I’m trying to get across with this video, is that even for the same data, you get a very different sensitivity pattern depending on which task you’re trying to perform.
And so this implicit attention mechanism is allowing it to process the same data in two very different ways, it’s seeing essentially, even though it’s being presented with the same data, it’s effectively seeing different things and seeing these different things is what allows it to perform different tasks.
So once again, the whole point about attention and the whole reason it’s so important is that it allows you to ignore some parts of the data and focus on others.
And this same concept also applies to recurrent neural networks, I think you’ve covered recurrent neural networks in an earlier lecture, and the idea here is that you’ve got a lecture that basically takes an input sequence, to take sequences as inputs, and produces sequences as outputs.
And what really makes recurrent neural networks interesting is that they have these feedback connections that give them some kind of memory of previous inputs.
And what we really want to know, as I said at the start of the lecture, memory can be thought of as attention through time.
So what we really want to know about recurrent neural networks is how are they using the memory to solve the task? And once again, we can appeal to the Jacobian to try to measure this use of memory, this use of past information or surrounding context.
And in this case, I tend to refer to it as a sequential Jacobian because what you’re really doing now, instead of getting a two dimensional matrix of partial derivatives, you’re really looking at a three dimensional matrix where the third dimension is through time.
And what you care about mostly is how sensitive is the network.
How sensitive are the decisions made by the network at one particular time to those inputs over other times.
In other words, what part of the sequence does it have to remember, does it have to recall, in order to solve the task.
Okay, so.
To make that a little bit more concrete, we’ve got the sequential Jacobian it’s a set of derivatives of one network output, so one output at one particular point in time with respect to all the inputs over time.
So there’s a time series, there’s a sequence of these 2D Jacobian matrices.
And what it can, what you can use this sequential Jacobian to analyse is how the network responds to inputs that in the sequence that are related, in the sense that they are needed together in order to solve a particular aspect of the task, but are not necessarily together or contiguous or close to one another in the input sequence, they may be widely separated.
And so in the example I’ve got here, this was from a network that I worked on some years ago that was trained to do online handwriting recognition.
So online handwriting recognition means that someone is, in this case, writing on a white board with a pen that has an infrared tracker that keeps a track of the location of the pen and is therefore able to record a trajectory of pen positions.
And it also records special end of stroke markers for when the pen is lifted off the whiteboard.
And so, this text at the bottom shows that the words that the person wrote were ‘once having’, and then the sort of this next graph up from the bottom shows how the information was actually presented to the network so that what the network actually saw was a series of these coordinates, X and Y coordinates, with these end of stroke spikes.
And then above that, excuse me, above that what we have is the sequential Jacobian and now what I’ve really looked at here, what I’m really interested here, is the magnitude of the sequential Jacobians, all these matrices over time, and what I’m really interested in is essentially the magnitude of the matrix, the magnitude of the response of the network, so of that particular, of one particular network output with respect to the inputs at a particular time.
And so the network output that I’ve chosen is the point, so I should say, the task here is for the network to transcribe this, these online pen positions and to kind of to recognise what it was that the person wrote and see there’s this output sequence here where it’s emitting label decisions, ‘o’, ‘n’, ‘c’, ‘e’, then the space character and it misses out the ‘v’ in this case it doesn’t entirely classify or transcribe this image correctly.
But the point that we are looking at is the point where it decides to output the letter ‘i’ in ‘having’, and what’s really interesting, if we look at the sequential Jacobian, we can see that there’s a peak of sensitivity around here, which roughly corresponds to the point in the input sequence where the stroke, the main body of the letter ‘i’ was actually written.
So it makes sense that there’s a peak of sensitivity here.
However, we can see that the sensitivity also extends further on in the sequence.
It doesn’t extend so far back in the sequence only very slightly.
So the sensitivity is mostly to the end.
And I believe the reason for this is that this suffix, ‘ing’, the ‘ing’ at the end of ‘having’ is a very common one.
And so being able to identify that whole suffix helps you to disambiguate the letter, ‘i’, it helps to tell you, for example, that it’s not an ‘l’ in there.
And what’s really interesting is this peak, this very sharp peak right at the end, and what that corresponds to is the point when the writer lifted the pen off the page, off the white board, and went back to dot the ‘i’.
So they wrote this entire word ‘having’ as one continuous stroke in their cursive handwriting and then they lifted the pen off the page and put a little dot there.
And of course that dot is crucial to recognising an ‘i’, right? That’s the thing that really distinguishes an ‘i’ from an ‘l’.
So again, it makes sense that the network is particularly sensitive to that point, but it’s nice to see that by analysing the sequential Jacobian you can really get a sort of quantifiable sense of the degree to which it’s using particular pieces of information.
And once again, I want to stress what’s really critical here is that means it’s ignoring other pieces of information.
It’s focusing on those parts of the sequence that are relevant and ignoring those that are irrelevant.
And you know, we can see that this is really quite powerful.
It’s able to bridge things that are related in the input sequence but may actually be quite far apart.
Another example here comes from machine translation.
Now, a major challenge in machine translation is that words may appear in a completely different order in a different language and so we have a simple example here where we have this infinitive ‘to reach’ at the start of an English sentence that’s being translated into German.
But in German, the corresponding verb appears at the end of the sentence.
And so in order to correctly translate this, the network needs to be able to reorder the information and from this paper in 2016, what it showed was just with a very deep network without any kind of specific mechanism for rearrangement or for attention, the network was able to use its implicit attention to perform this rearrangement.
And so what we’re seeing in the heat map on the right here is again, this idea of sensitivity, it’s a sensitivity map of the outputs at particular points in the target sequence, so in the German sequence, with respect to the inputs in the English sequence.
And you can see mostly there’s a kind of diagonal line because in this particular case, most of the sequence, most of the words have a more or less direct sort of one-to-one translation.
But there’s this part at the end of the sequence for the final two words in German are particularly sensitive to the words at the start in English.
So this word ‘reach’ is, there’s a peak of sensitivity from the end of the sequence.
And of course this is once again showing that the network is able to use this implicit attention that it gets in some sense for free just by being a very deep network, by being a very, you know, rich function approximator, it’s able to use that to focus in on a particular part of the sequence and to ignore the rest of the sequence.
Well, you know, implicit attention is great, but there are still reasons to believe that having an explicit attention mechanism might be a good thing.
So what I mean by an explicit attention mechanism is one where you actually decide to only present some of the data to the network and you know, completely remove other parts of the data.
And one reason this might be preferred, of course, is computational efficiencies.
So you no longer have to process all of the data, you don’t have to feed it to the network at all, so you can save some compute.
There’s a notion of scalability.
So for example, if you’ve got a fixed size, what I’ll call a ‘glimpse’ or like a foveation, where you take in a fixed size part of an image, then you can scale to any sized image so that the resolution of the input doesn’t have to sort of alter the architecture of the network.
There’s this notion of sequential processing of static data, which I think is an interesting topic.
So again, if you go look at a kind of visual example, if we have a foveal gaze moving around a static image, then what we get is a sequence of sensory input.
And of course this is how images are presented to the human eye.
We’re always actually, even if the data is static, we’re always actually receiving it as a sequence.
And there’s reasons to believe that doing this can improve the robustness of systems.
So, for example, there was a recent paper that showed that networks with sequences of glimpse or foveal attention mechanisms for static data were more robust to adversarial examples than ordinary convolutional networks that looked at the entire image in one go.
Last but not least, there’s a big advantage here in terms of interpretability.
So because explicit attention requires, you know, making a hard decision and choosing some part of the data to look at, you can analyse a little bit more clearly what it is that the network is actually using.
So, you know, with implicit attention we’ve looked at the Jacobian as a guide to what the network is looking at, but it really is only a guide.
It’s not real, it’s not necessarily an entirely reliable signal as to what the network is using and what it’s ignoring whereas with explicit attention mechanisms, as we’ll see, you get a much clearer indication of the parts of the data that the network is actually focusing on.
So the basic framework for what I’m going to call neural attention models is that you have a neural network as usual that is producing an output vector as always, but it’s also producing an extra output vector that is used to parameterise an attention model.
So it gives some set of parameters that are fed into this attention model, which we’ll describe in a minute, and that model then operates on some data, whether that’s an image that you’re looking at or audio or text or whatever it is and gives you what I’m going to call a glimpse vector.
And this is non-standard terminology, I’m just using it because I think it helps to kind of unify these different models.
That glimpse vector’s then passed to the network as input at the next timestep.
And so there’s this kind of loop going on where the network makes a decision about what it wants to tend to, and that then influences the data it actually receives at the next step.
And what that means is that even if the network itself is feed forward, the complete system is recurrent.
It contains a loop.
So the, you know, the way this model usually works is that we define a probability distribution over glimpses, ‘g’, of the data, ‘x’, given some set of attention outputs.
So I’ve set this attention vector ‘a’ and that’s used to parameterise something like the probability of glimpse ‘g’ given ‘a’.
So the simplest case here is we just split the image into tiles and this image on the right here you can see there’s nine possible tiles and ‘a’ just assigns probabilities through a set of discreet glimpses as in to a set of, to each of these tiles that’s using one of these tiles.
So it’s just a kind of good old fashioned softmax function here where the softmax outputs are the probabilities of picking each tile.
And so we can see that having done that, if we have a network that is using this distribution, what it’s going to do is, you know, output some distribution over these nine tiles and then at each point in time it’s going to receive one of the tiles as input.
So rather than receiving the whole input at once, it’s going to keep on looking at one tile at a time.
Now one issue with this of course, is that it’s a hard decision.
And what I mean by hard decision is it’s no longer, we no longer have a complete gradient with respect to what the network has done.
Basically what we’ve got is a stochastic policy in reinforcement learning terms that we’re sampling from in order to get the glimpses.
And we can train this with something like REINFORCE.
So I’ve kind of given that, you know, the simple kind of standard mathematics here for how you get a gradient with respect to some stochastic sort of discreet sample using REINFORCE.
And this is a sort of general trick here.
We can use these sorts of what I’m going to call RL methods, by which I really just mean methods that are designed for getting a training signal through a discreet policy and we can sort of fall back on these for supervised tasks like image classification anytime there’s a non-differentiable module in there.
And what we can’t do is just ordinary end-to-end backprop.
And this is a significant difference between using kind of hard attention as I’ve described it so far versus using this implicit attention that’s always present in neural networks.
So generally we want to do something a little bit more complex than just a softmax over tiles.
One example that I’ve kind of already alluded to is this notion of a foveal model where you have a kind of multiresolution input that looks at the image, takes part of the image at high resolution, so in this case, the square in the centre here is kind of recorded at high resolution, it’s basically just mapped at one-to-one.
This next square out is also presented to the network but at a lower resolution.
So you can see the actual, it’s taking something that maybe has twice as many pixels as the one in the middle and subsampling it down to something with the same number of pixels.
And then the third square out looks at the entire image here that gives this very kind of squashed down low resolution version of it to the network.
And the idea is that you’re mimicking the effect of the human eye where it has high kind of, high resolution in the centre of your gaze and much lower resolution in the periphery with the idea being that the information at the periphery is sufficient to alert you to something that you should attend to more closely.
You should look at directly in order to get a higher resolution view of it.
And we can see an example of this apply to image classification.
This is from a 2014 paper where the network was given the cluttered MNIST data where these MNIST, these familiar MNIST handwritten digits are basically dropped in an image that has some visual clutter.
And the idea here is that in order to classify the image, the network has to discover the digit within the clutter.
Again, once again, it’s about being able to ignore distractors, being able to ignore the noise.
And the green path here shows the movement of this foveal model through the image over this kind of six point trajectory that it’s given while it classifies the image.
And we can see, for example, in this, on the example on the top row, it starts down here in the bottom corner where there isn’t much information but then rapidly moves towards the digit in the image and then kind of scans around the digit.
And in the pictures to the right, we can see the information that’s actually presented to the network.
Basically, you know, it starts off with something where there’s very little information about the image, but there’s a blur over here that suggests there might be something useful.
And then it moves over to there.
And by moving around the image, it can build up a picture of, you know, everything that’s in the digit that it needs to classify.
And we have a similar example here for the letter eight where it kind of moves around the periphery of the digit in order to classify it.
So a similar, and so you might ask, you know, why would you bother doing that when you can feed the whole image into the network directly? And so one issue I mentioned earlier is this idea of scalability and one way in which a sequential glimpse distribution is more scalable is that you can use it, for example, to represent multiple objects.
This was extraordinary explored in another paper in 2014 where there were, so for example, in the street view, house numbers dataset there are multiple numbers from people’s street addresses present in each image.
And you want to kind of scan through all of those numbers in order to recognise them in order rather than just looking at the image in a single go, although it can also be applied to more conventional image classification as shown here.
And once again, in order to classify the image, the network will move its attention around the really important parts of the image.
And this gives you an indication, it allows you to see what it is in the image that is necessary in order to make the classification.
So, so far we’ve looked at both implicit and explicit attention, but the explicit attention we’ve looked at has involved making hard decisions about what to look at and what to ignore, and this leads to the need to train the network using RL-like mechanisms.
It makes it impossible to train the whole thing end-to-end with backprop.
So what we’re going to look at in this section is what’s sometimes known as soft or differentiable attention, which makes, gives you explicit attention but makes end-to-end training possible.
So whereas in the previous examples we had these fixed size attention windows that we were kind of explicitly moving around the image, now we’re going to look at something that operates a little bit differently.
And, you know, it’s important to realise that, you know, if we’re thinking about a robot or something where you have to actually direct a camera in order to direct your attention, then in some sense you have to use hard attention because you have to make a decision about whether to look left or right.
But for the kinds of systems we’re mostly focusing on in this lecture, that isn’t really the case.
We’ve got all the data and we just need to make a decision about what to focus on and what not to focus on.
And so we don’t actually need to make a hard decision about attention.
We want to focus more on some regions and less on others in much the same way that I showed that we already implicitly do with a neural network.
But we can take this one step further than implicit attention by defining one of these soft attention, these differentiable attention mechanisms that we can train end-to-end.
And they’re actually pretty simple.
There’s a very basic template.
So if we think back to the glimpse distribution I talked about before where we have the parameters of the network to finding some distribution over glimpses.
And what we did then was take a sample from that distribution and it was because we were picking these samples that we needed to think in terms of training the network with reinforcement learning techniques.
So what we can do instead is something like a mean field approach.
We take an expectation over all possible glimpses instead of a sample.
So it’s just this weighted sum where we take all of the glimpse vectors and multiply them by the probability of that glimpse vector given the attention parameters and sum the whole thing up.
And because it’s a weighted sum and not a sample, this whole thing is straightforwardly differentiable with respect to the attention parameters, ‘a’, as long as the glimpse distribution itself is differentiable, which it usually is.
So now we no longer have, you know, REINFORCE or some reinforcement learning algorithm.
We really just have ordinary backprop.
And in actual fact, because we’re doing this weighted sum, we don’t really technically need a probability distribution at all.
All we need is a set of weights here.
So we have a set of weights and we’re multiplying them by an attention, we’re multiplying them by some set of values, which are these glimpses, and the weighted sum of these two things gives us the attention readout.
Now there’s, I’ve got a little asterisk here on the slide where I’m saying: yes, we don’t actually need a proper probability distribution here, but it’s usually a nice thing to have.
So just if we make sure the weights are all between zero and one and that they sum to one then everything tends to stay nicely normalised and sometimes it seems to be a good thing as far as training the network goes.
But anyway, if we look at this weighted sum, this attention readout ‘v’, which is just now if we think, stop thinking probabilistic terms and just think of sum of our ‘i’ times some weights ‘i’, times some vectors ‘i’, this should look familiar to you, it’s really just an ordinary summation, a Sigma network, a Sigma unit from an ordinary neural network.
And in fact, where these weights, ‘wᵢ’, look like network weights.
So we’ve gone from, you know, glimpse probabilities defined by the network to something that looks more like network weights.
And actually we can think of attention in general as defining something like data dependent dynamic weights or fast weights as they’re sometimes known.
And they’re fast because they change dynamically in response to the data so they can change in the middle of processing a sequence, whereas ordinary weights change slowly.
They change gradually over time with gradient descent.
And so to look at these two sort of diagrams I’ve got here on the left, we have the situation with an ordinary ConvNet, where this would be sort of a one dimensional convolutional network where you have a set of weights that are given in different colours here that are used to define a kernel that is mapping into this input that the arrows are pointing into.
But the point is those weights are going to stay fixed.
They’re fixed.
The same kernel is going to be scanned over the same image and those weights are over the same sequence, in this case it’s one dimensional.
And those weights are only gradually changing over time.
And in addition of course, because it’s a convolution, there’s a fixed size to the kernel.
So we’ve decided in advance how many inputs that are going to be, that are fed into this kernel.
With attention we have something more like the situation on the right, so we have this set of weights that first of all extends, can in principle extend, over the whole sequence.
And secondly, critically, those weights are data dependent.
They’re, a function because they’re emitted, you know, they’re determined by the attention parameters that are emitted by the network, which is itself a function of the inputs received by the network.
So these weights are responding to the input they’ve received.
So they’re giving us this ability to kind of define a network on the fly.
And this is what makes attention so powerful.
So my first experience of attention with neural networks of soft attention with neural networks was a system I developed, some years ago now, I think seven years ago, to do handwriting synthesis with recurrent neural networks.
So handwriting synthesis, unlike the handwriting recognition networks I mentioned earlier, here the task is to take some piece of text like this, the word ‘handwriting’ on the left, and to transform that into something that looks like cursive handwriting.
And basically the way this works is the network outputs, it takes in a text sequence and outputs a sequence, a trajectory of pen positions and these positions define the movement of, or define the actual writing of the letters.
So you can think of this as a kind of sequence to sequence problem but the challenging thing about it is that the alignment between the text and the writing is unknown.
And so I was studying this problem with recurrent neural networks and I found that if I just fed the entire text sequence in as input and then attempted to produce the output, it didn’t work at all.
What I needed was something that was able to attend to a particular part of the input sequence when it was making particular decisions about the output sequence.
So for example, I wanted something that would look at the letter ‘h’ in the input sequence and use that as the conditioning signal for when it was drawing a letter ‘h’ and move on to the letter ‘a’ and so forth.
So once again, I needed something that was able to pick out certain elements of the input sequence and ignore others.
And this was achieved with soft attention.
So basically the solution was that before the network made each, predicted each point in the handwriting trajectory, it decided where to look in the text sequence using a soft attention mechanism.
And so the mechanism here, which is a little bit different from the normal attention mechanisms that you see in neural networks that we’ll talk about later, the mechanism here was, the network explicitly decided how far along to slide a Gaussian window it had over the text sequence.
So there was a kind of, I thought of it as a soft reading network.
And so the weights, the parameters emitted by the network to determine the set of Gaussians, these are shown here, Gaussian functions, whose, these are shown here by these coloured curves, and those functions had a particular centre which determined where they were focused on the input sequence, and also it was also able to parameterise the width of the Gaussian so it kind of determined how many of the letters in the input sequence it was looking at.
And I should say the sequence of input vectors here, what I’ve shown as a series of one-hot vectors, which is how they’re presented to the network, but what these actually correspond to is letters.
So you can think of this as an ‘h’ here and an ‘a’ here and so forth and then what the network is deciding is where to put these Gaussians, which implicitly means once we perform this summation at the top here that gives us the attention weights, what part of the text sequence should we look at in order to generate the output’s distribution.
And so doing this, the network was able to produce remarkably realistic looking handwriting.
These are all generated samples and you can see that it also generates, as well as being able to legibly write particular text sequences, it writes in different styles.
And the reason it does this of course, is that it’s trained on a database of networks of people who, sorry, a database of handwriting from people writing in different styles.
And so it kind of learns that in order to generate realistic sequences, it has to pick a particular style and stick with it.
So I’m claiming on this slide that real people write this badly.
Maybe that’s not quite strictly true, but you can, you know, you can see at least that here was a system where attention was allowing the network to pick out the salient information and using that to generate something quite realistic.
And so, as I said, one advantage of this use of attention is that it gives you this interpretability, it allows you to look into the network and say, what were you attending to when you made a particular decision? And so this heat map here, what it shows is while the network was writing the letters shown along the bottom, so if the writing here is that the handwriting here is a horizontal axis, the vertical axis is the text itself.
And you can see what this heat map shows is what part of the text was the network really focusing on when it was producing a particular, when it was predicting a particular part of the pen trajectory.
And you can see that there’s this roughly diagonal line because of course, you know, there is here a one, really a one-to-one correspondence between the text and the letters that it writes.
But this line isn’t perfectly straight.
So the point is that some, well some letters might take, you know, have 25 or 30 points in them or even more others letters might have much fewer.
And so this is, the whole issue of the alignment being unknown that attention was able to solve in this instance.
And so this is an example, an early example of what’s now kind of thought of as location-based attention.
So the attention is really about just where, how far along the input sequence you should look.
And so, it’s important, what’s kind of interesting here is to see what happens if you take that attention mechanism away.
And just allow the network to generate handwriting unconditionally.
And this was very similar to the result I obtained when I first tried to treat this task as a more conventional sequence to sequence learning problem where the entire text sequence was fed to the network at once.
And what happens is it generates things that kind of look like words that kind of look like letters but don’t make much sense.
And of course it’s obvious the reason for this is that the conditioning signal isn’t reaching the network because it doesn’t have this attention mechanism that allows it to pick out which letter it should write at a particular time.
Okay, so, that was sort of an early example of a neural network with soft attention.
But the form of attention that’s really kind of taken over, the one that you’ll see everywhere in neural networks now, is what I think of as associative or content-based attention.
So instead of choosing where to look according to the position within a sequence of some piece of information, what you can do instead is attend to the content that you want to look at.
And so the way this looks is that, works is that the network, the attention parameter emitted by the network is a key vector.
And that key vector is then compared to all the elements in the input data using some similarity functions.
So typically you have something like cosine similarity or something that involves taking a dot product between the key and all the elements in the data.
And then typically this is then normalised as something like a softmax function and that gives you the attention weights.
So, you know, implicitly what you’re doing is you’re outputting some key, you’re looking through everything in the data to see which parts of the data most closely match to that key and you’re getting back a vector, an attention vector that focuses more strongly on the places that are more, that are closer, that correspond more closely to the key.
And this is a really natural way to search.
You can actually define, you can do, you can essentially do everything you need to do computationally just by using a content-based lookup.
And what’s really interesting about it is that especially with this sort of cosine similarity measure, it gives you this multidimensional feature-based lookup.
So you can put a set of features corresponding to particular elements of this key vector and find something that matches along those features and ignore other parts of the vectors.
So just by setting other parts of the vector to zero, you’ll get something that matches on particular features and doesn’t worry about others.
So it has, it gives this multidimensional, very natural way of searching.
So for example, you might want to say, well, show me an earlier frame of video in which something red appeared.
And you can do that by specifying the kind of red element in your, the representation of your key vector.
And then the associative attention mechanism will pick out the red things.
So typically what’s done now is given these weights, you can then perform this expectation that I mentioned earlier where you sum up over the data, you compute this kind of, this weighted sum and you get an intention readout.
What you can also do, and this has become I think increasingly popular with attention-based networks, is you can split the data into key value pairs and use the keys to define the attention weights and the values to define the readout.
So there’s now a separation between what you use to look up the data and what you’re actually going to get back when you read it out.
And this has been used, I mean, as I said, this is now really a fundamental building block of deep learning.
And it was first applied in this paper from 2014 for neural machine translation.
And once again, so similar to the heat map I showed you in a previous slide for implicit attention, we have something here that shows what the network is attending to when it translates in this case from, I believe it’s translating from English to French or it might be from French to English, and what’s kind of interesting here you can see, first of all, if we compare this to the earlier heat map I showed for implicit attention, it’s clear that the decisions are much sharper, so you get a much stronger sense here of exactly what the network is attending to and what it’s ignoring.
Secondly, in this case, there’s a more or less, one-to-one correspondence between the English words and the French words, apart from this phrase, ‘European Economic Area’, that is reversed in French.
And you can see this reversal here in the image by this kind of, line that goes sort of against the diagonal of the rest of the sequence.
And so this is a very, as we’ll see, this is a very powerful general way of allowing the network in a differentiable end-to-end trainable way allowing the network to pick out particular elements of the input data.
Here’s an example of a similar network in use.
Here, the task is to determine what this removed symbol is in the data.
So if we look at the example on the left, we have, I should say the proper names have been replaced by numbered entities here, which is quite a standard thing to do in language processing tasks because proper names are very difficult to deal with otherwise.
And we have this task where entity one one nine, identifies deceased sailor as X and what the network has to do is to fill in X, and you can see from this heat map here which words it’s attending to when it attempts to fill in this X.
And you can see it’s mostly particularly focused on this entity 23, which was presumably the decision it made and which is indeed correct.
It says he was identified Thursday as special warfare operator entity 23.
In general it’s focusing on the entities throughout because it can kind of tell that those are the ones that it needs to look at in order to answer these questions.
Similarly, X dedicated their fall fashion show to moms.
You can see it’s very focused on this particular entity here that’s helping it make this decision.
And what’s really crucial here is that there’s a lot of text in this piece, there’s a lot of text that it’s ignoring, but it’s using this content-based attention mechanism to pick out specific elements.
And this can be taken, this, you know, attention mechanism, this combination typically of a recurrent neural network with attention can be used much more broadly.
It’s been applied to speech recognition for example.
And here we see a plot, not dissimilar to the one I showed you for handwriting synthesis where we have an alignment being discovered between the audio data here presented as a spectrogram and the text sequence that the network is outputting, the characters that it’s using to transcribe this data.
And so for example, there was this long pause at the start when nothing happens, the network mostly ignores that.
It knows that when it has to start emitting, for example, the ‘st’ at the start of the sentence, it’s very focused on these sounds at the beginning corresponding to those noises in the speech signal.
So basically this attention mechanism is a very general purpose technique for focusing in on particular parts of the data.
And this is all done with, well mostly all done with content-based attention.
Okay so, another form of, so there are a huge number of possible attention mechanisms and we’re only going to mention a few of them in this talk.
And one idea I want to leave you with is that there’s a very general framework here.
Having defined this attention template that gives you this weighted sum, there’s lots of different operators you could use to get those attention weights.
And one very interesting idea from a network known as DRAW from 2015, the idea was to determine an explicitly visual kind of soft attention.
So this is kind of similar to the foveal models we looked at earlier only instead of an explicit kind of hard decision about where to move this fovea around the image, rather there was a set of Gaussian filters that were applied to the image.
And what these did is they have a similar effect of being able to focus in on particular parts of the image and ignore other parts but it’s all differentiable end-to-end because there’s a filter that is being applied everywhere that gives you these attention weights.
And what does this filter look like? Well if you look at these three images on the right, we show that for different settings of the parameters for the Gaussian filters, the filter variants, essentially the, you know, the width of the filter, the centre, the stride, as we’ve shown here with which the filter is applied throughout the image, and also this last parameter for intensity, by burying these we get different views of this same letter five.
So this one here is quite focused in on this central part of the image.
This one here is looking more at the image as a whole and it’s doing so with quite low variance.
So it’s getting quite a sharp picture of this image.
This one on the bottom here is getting a more kind of blurred, like less distinct view of the entire image.
And so we can see a video of a DRAW network in action.
What we’re seeing here, the movement of these green boxes shows the attention of the network.
I’m just going to play that again, it’s rather quick.
The attention of the network, as it looks at an MNIST digit and you can see that it starts off kind of attending to the whole image and then very quickly zooms in on the digit and moves the box around the digit in order to read it.
And it does a similar thing when it starts to do generate data.
It uses, so this red box shows where it’s attention is as it’s generating the data.
And once again, it starts off kind of generating this kind of blurred out view of the whole image and then focuses down on a specific area and kind of, it does something that looks a lot like, it’s actually drawing the image, it’s actually using the attention mechanism to trace out the strokes of the digit.
And so again, what’s nice about this, so we have something that’s kind of, transforming, excuse me, transforming a kind of static task into a sequential one where there’s this sequence of association decisions being, sorry, this sequence of glimpses or views of the data.
And what’s nice about that is that we’d get this generalisation so we can now generate, because it generates these images kind of one part of the time, it can be extended to something that generates multiple digits, for example, within the same image.
And this is a sort of a general, an illustration of this general property of, I think, scalability that are referred to for engine mechanisms.
So far what we’ve talked about is attention applied to the input data being fed to the network.
But as I mentioned at the start of the lecture, there’s another kind of attention which I think of as introspective or kind of inward attention where we as people use our kind of, use a kind of cognitive attention to pick out certain thoughts or memories and in this section I’m going to discuss how this kind of attention can be introduced to neural networks.
So as I’ve said, in the previous slides, what we were looking at was attention to external data.
So deciding where in a text sequence to look, which part of an image to look at and so forth.
But if we sort of apply this attention mechanism to the network’s internal state or memory then we have this notion of introspective attention.
And as I’ve said, the way I like to think about is that memory is attention through time.
It’s a way of picking out a particular event that may have happened at some point in time and ignoring others.
And once again, just want to come back to this idea that attention is all about ignoring things, it’s all about what you don’t look at.
And so there’s an important difference here between internal information and external information, which is that we can actually modify the internal information so we can do selective writing as well as reading, allowing the network to use attention to iteratively modify its internal state.
And an architecture that I and colleagues at DeepMind developed in 2014 did exactly this.
We called it a Neural Turing Machine because we, what we wanted was something that resembled the action of a Turing machine, its ability to read from and write to a tape, using a neural network by an attention, a set of attention mechanisms.
And I’m going to talk about this architecture in some detail because it shows, it gives you a sort of nice insight into the variety of things that can be achieved with attention mechanisms.
And it shows how, it really shows this link between attention and memory.
So the controller in this case is a neural network.
It can be recurrent or it can be feed forward.
Once again, even if it’s feed forward, the combined system is recurrent because there’s this loop through the attention mechanisms.
And then we have, we referred to the attention modules that are parameterised by the network as ‘heads’.
And so this was in keeping with the Turing machine analogy, the tape analogy.
But this is something that I think has been picked up in general.
People often talk about attention heads.
And, you know, these heads are attention mechanisms, are soft attention mechanisms in the same kind of template that we’ve discussed before.
And their purpose is to select portions of the memory.
The memory is just this real-valued matrix.
It’s just the big grid of numbers that the network has access to.
And the key thing that’s different is that as well as being able to select portions of the memory to read from these heads can also selectively write to the memory.
So yeah, once again, this is all about selective attention.
We have to, we don’t want to modify the whole memory in one go.
Maybe, you know, I should stress here that a key part of the kind of design decision underlying the Neural Turing Machine was to separate out computation from memory in the same way as is done in a normal digital computer, we didn’t want, so for a normal recurrent neural network, for example, in order to give the system more memory, you have to make the hidden state larger, which increases the amount of computation done by the network as well as giving it more memory.
So computation and memory are kind of inherently bound up in an ordinary network and we wanted to separate them out.
We wanted potentially quite a small controller that can have access to a very large memory matrix.
In the same way that a small processor in a digital computer can have access to, you know, a large amount of RAM or disc or other forms of memory.
And so it’s key, you know, if you look at it from that perspective, it’s key that it’s not processing the entire memory at once.
If this thing’s going to be large, it needs to selectively focus on parts of it to read and write.
And so we do this basically using a similar, the same template as I mentioned before for soft attention, the controller, the neural networks, outputs parameters that basically parameterise this what we’re calling a distribution or a weighting over the rows in the memory matrix.
But this weighting is really just the same attention weights that we discussed before.
And we have two main attention mechanisms.
So I’ve mentioned in the previous section that my first experience of soft attention in neural networks was around location-based attention as was applied for this handwriting synthesis network, which was in fact the inspiration for the Neural Turing Machine.
So having realised that the handwriting synthesis network could selectively read from an input sequence, I started to think: well, what would happen if it could write to that sequence as well? And wouldn’t it then start to resemble a neural, a Turing Machine? But as well as the location-based content that was considered in the handwriting synthesis network, this also incorporates content-based attention, which as I’ve said, is the kind of the preeminent form of attention as used in neural networks.
So addressing by content looks a lot like it does with other content-based networks.
There’s a key vector emitted by the controller that is compared to the content of each memory location.
So take each row in memory and treat that as a vector.
And then we compare the key to that vector using the similarity measure, which was indeed cosine similarity, which we then normalise with a softmax.
We also introduced an extra parameter which isn’t usually there for content-based attention, which we called ‘sharpness’, and this was used to sort of selectively narrow the focus of attention so that it could really focus down on individual rows in the memory.
But we also included this notion of addressing by location.
And the way this worked was that the network looked at the previous weighting and output a shift kernel which was just a, sort of, a softmax of numbers between plus and minus ‘n’.
And we then essentially convolve that with the weighting, with the previous weighting, the weighting from the previous timestep to produce a shifted weighting.
So basically the maths here, very simple, are shown below.
And what this did is it just essentially shifted attention through the memory matrix.
It shifted it down.
So if you started here and output a shift kernel focused around maybe five steps or so, then you’d end up with an attention distribution that would look like this.
And so this combination of addressing mechanisms, the idea behind this was to allow the controller to have different modes of interacting with the memory.
And we thought about these modes as corresponding to data structures and accessors as are used in sort of conventional programming languages.
So as long as the content is being used on its own, then memory is kind of being accessed the way it would be in something like a dictionary or an associative map.
Well not strictly like a dictionary because we didn’t have key value attention for this network although you could define it.
Rather it would be more like an associative array.
Through a combination of content and location, what we could do is use the content-based key to locate something like an array of contiguous vectors in memory and then use the location to shift into that array, to shift to index a certain distance into the array.
And when the network only used the location-based attention, essentially it acted like an iterator, it just moved on from the last focus.
So it could essentially read a sequence of inputs in order.
So, basically reading, as we’ve said, this network uses attention to both read from and right to the memory, reading is very much the standard attention, sort of soft attention template.
We get a set of weights, we got a set of rows in the memory matrix to which those, you know, that are, to which the network is attending.
And we compute this weighted sum.
So we take each row in the metrics and multiply it by the weight, which gives the sharpness of attention to the degree to which the network is attending to that particular role.
So this is just very much, this is exactly like the soft attention template I described before only it’s being applied to this memory matrix rather than being applied to some external piece of data.
The part that was kind of novel and unusual was the write head, the writing attention mechanism used by Neural Turing Machines.
And so in this case we kind of, inspired by the way long short term memory LSTM has sort of forget and input gates that are able to modify the contents of memory, the contents of its own internal state, we defined a combined operation of an erase vector, ‘e’, which behaves sort of analogously and equivalently to the forget date in long short term memory and an add vector which behaves like the input gate.
And essentially what happened is that once the write head had determined which rows in the matrix it was attending to, the contents of those rows were then selectively erased according to ‘e’; and I should say here: so ‘e’ is basically a set of numbers between zero and one.
So basically if some part of the erased vector goes to one, that means whatever was in the memory matrix at that point is now wiped, it’s set to zero.
And if ‘e’ is set to zero, then the memory matrix is left as it is.
So once again, there’s this kind of smooth or differentiable analogue of what is essentially a discreet behaviour, the decision of whether or not to erase.
And adding is more straightforward, it just says: well, take whatever’s in memory and add whatever’s in this add vector, ‘a’, multiplied by the write weights.
So basically if the write weight is high and you’re strongly attending to a particular area in a particular row in the matrix, then you are going to essentially add whatever is in this add vector to that row.
And if the write vector, the important thing here is if this ‘w[i]’ for all the rows in the matrix for which this ‘w[i]’ is very low, nothing happens, nothing changes, so if you’re not attending to that part in the memory, you’re not modifying it either.
So how does this work in practice? So what we really looked at was well, can this Neural Turing Machine learn some kind of primitive algorithms in the sense that we think of algorithms as applied on a normal computer and you know, in particular does having the separation between processing and memory enable it to learn something more algorithmic than we could do, for example, with a recurrent neural network? And we found that it was indeed able to learn, it’s a very simple algorithm.
So the simplest thing we looked at was a copy task.
So basically a series of random binary vectors were fed to the network, that’s shown here, at the start of the sequence, and then the network just has to copy all of those and output them through the output vectors.
And all it has to do is just exactly copy what was in here to what’s over here.
So it’s an entirely trivial algorithm that would be, you know, very uninteresting.
It’s not interesting in its own right as an algorithm.
But what’s surprising about it is that it’s difficult for an ordinary neural network to do this.
So neural networks generally are very good at pattern recognition.
They’re not very good at exactly sort of remembering, storing and recalling things.
And that was exactly what we hoped to add by including this access to this memory matrix.
And so the algorithm that it uses, well, a kind of pseudocode version for it here is given on the right but we can also analyse it by looking at the use of attention and the way it attends to particular places in the memory during this task.
And so these two heat maps that are shown here at the bottom are again heat maps showing the degree to which the network is attending to that particular part of the memory.
So when it’s black, it’s being ignored, when it’s white, it’s focusing and you can see that there’s a very sharp focus here, which is what we want because it’s basically implementing something that is really fundamentally a discreet algorithm.
And so what it does in order to complete this copy task is it picks a location in memory, given here, and then starts to write whatever input vector comes in, essentially just copies that to a row of memory and then uses the location-based iterator, this location-based attention, to just move on one step to the next row of memory and then it copies the next input and so forth until it’s finished copying them all.
And then when it has the output it uses it’s content-based lookup to locate the very start of the sequence and then just iterates through until it’s copied out everything remaining.
And so, you know, just once again, what was really interesting here was to be able to get a kind of, an algorithmic structure like this going through a neural network that, you know, it was completely parameterised by neural network and it was completely learned end-to-end and there was nothing sort of built into the network to adapt it towards this sort of algorithmic behaviour.
And so the real issue is, and actually a normal recurrent neural network, an LSTM model, LSTM network for example, can perform this task.
You feed in a sequence of inputs and ask it to reproduce them as outputs just as a sequence to sequence learning problem.
But what you find is it will work reasonably well up to a certain distance, but it won’t generalise beyond that.
So if you train it to copy sequences up to length 10, and then ask it to generalise the sequences up to length, you know, a hundred, you’ll find it doesn’t work very well as we’ll see.
Whereas with the Neural Turing Machine, we found that it did work quite well.
So in these heat maps here, we’re showing the targets and the outputs.
So basically this is the copy sequence given to the network, if it’s doing everything right, each block at the top exactly matches each block at the bottom.
And you can see that it’s not perfect, like there’s some mistakes creeping in as the sequences get longer.
So this is for sequences, you know, short sequences like 10, 20, 40, so on.
But you can still see that most of the sequence is still kind of retained.
Like most of the targets are still being matched by the outputs of the network.
And that’s because it’s just basically performing this algorithm and using that to generalise the longer sequences.
So this is an example of where attention and being able to selectively pick out certain parts of information and ignore others gives you a stronger form of generalisation.
And so this form of, this kind of generalisation that we see with Neural Turing Machines does not happen with, you know, a normal LSTM model, for example, essentially it learns to copy up to 10 and then after 10, it just goes completely awry.
It starts to output sort of random mush.
And this really shows that it hasn’t learned an algorithm, it’s rather kind of hard-coded itself, it’s learned internally to store these 10 things in some particular place in its memory and it doesn’t know what to do when it goes beyond that.
So in other words, because it lacks this attention mechanism between the network and the memory, it’s not able to kind of separate out computation from memory, which is what’s necessary to have this kind of generalisation.
And so this can be extended.
We look at other tasks, well, one very simple one was to learn something akin to a for-loop.
So if the network is given a random sequence once it’s then also given an indicator telling it how many times it should reproduce that output sequence.
And then it just has to output the whole sequence N times, to copy N times.
And so basically what it does is just uses the same algorithm as before except now it has to keep track of how many times it’s output the whole sequence.
So it just keeps on jumping to the start of the array with content-based, you know, to a particular row in memory using content-based location, then iterates one step at a time, gets to the end and jumps back.
And meanwhile it has this sort of internal variable that keeps track of the number of steps that it’s done so far.
Another example of, you know, what it could do with memory was this N-Gram inference task.
So here the task is that a sequence is generated using some random set of N-Gram transition probability.
So basically saying, given sort of some binary sequence, given the last three or four inputs, there’s a set of probabilities telling you, you know, whether the next input will be a zero or a one.
And those probabilities are sort of randomly determined and then you generate a sequence from them.
And what you can do as the sequence goes on, you can infer what the probabilities are.
And there’s a, you know, a sort of, a Bayesian algorithm for doing this optimally.
But what was sort of interesting as well, how well does a neural network manage to do this? How well does it, sort of, it’s kind of like a meta-learning problem where it has to look at the first part of the sequence, work out what the probabilities are, and then start making predictions in accordance with those probabilities.
And what we found was that yes, once again, LSTM can kind of do this, but it doesn’t do it in a very kind of, it makes quite a lot of mistakes.
I’ve indicated a couple of them with red arrows here, Oh no, sorry, excuse me.
Those red arrows I think actually indicate mistakes made by the Neural Turing Machine.
But in general, the Neural Turing Machine was able to sort of perform this task much better.
And the reason it was able to do that is that it used its memory, it used specific memory locations to store variables that kept count of the occurrences of particular N-Grams.
So if it had seen zero zero one for example, it would use that to define a key to look up a place in memory, it might be this place here, and then the next time it saw zero zero one it would be able to increment that, which is basically a way of saying, okay, if I’ve learned that zero zero one is a common transition, that means that the probability of two zeros followed by one must be really high.
So basically it learns to count these occurrences which is exactly what the optimal Bayesian algorithm does.
And it uses, it is able to do this by being able to pick out selective, specific areas in its memory and use those as counters.
So here’s a little video kind of showing the system in action.
So this is being performed, this repeat N times task, so at the start here where the network is going quickly we see what happens during training.
Then we have the system, everything slows down here and we have, you know, what happens with a trained version of the network, the input data comes in.
While the input data was coming in, we saw this blue arrow here, which showed the input data being written to the network memory one step at a time.
So it stored this input sequence in memory and then the writing task begins.
Once the writing task begins, we see this red arrow which represents the write weights, the attention parameters used for writing.
And we can see that these are, you know, very tightly focused on one particular role in the network, the one that it’s emitting as output at any one point in time and it’s then iterating through this array one step at a time.
But what you can also see as this video goes on is that the, sorry, the size of the circle, the size and colours of the circles here represent the magnitude of the variables within this memory matrix.
I believe the sort of the hot colours are positive and the cold colours are negative as I remember.
But as, what’s happening as the network is looping through this, whoops, is running through this loop several times.
Just run that video again and we can see during training that, you know, at first these attention, these read and write weights are not sharply focused.
They’re blurred out.
So this sharp focus kind of comes later on.
Once the network has finished writing the whole sequence, it then sort of, you see these variables in the background become larger.
That’s because it’s using those to keep count of how many times it’s been through the, how many times it’s repeated this copy operation.
And then at the end it changes this final, this row at the bottom, which is an indicator to the network that the task is complete.
So it’s using this memory to kind of perform an algorithm here.
And so quickly, I’m just going to mention we, following the Neural Turing Machine, we introduced the kind of extended version of it, a successor architecture, which we called Differentiable Neural Computers and we introduced a bunch of new attention mechanisms to provide memory access.
And I’m not going to go through that in detail but just to say, one of the, rather than looking at algorithms with this sort of updated version of the architecture, what we were really interested in was looking at graphs because, so you know, while recurrent neural networks are designed for sequences in particular, many types of data are more naturally expressed as a graph of, you know, nodes and links between nodes.
And because of this ability to store information in memory and to store and recall with kind of something akin to random access, it’s possible for the network to store quite a large and complex graph in memory and then perform operations on it.
And so what we did during training of the system is we looked at randomly connected graphs and then when we tested it, we looked at specific examples of graphs.
So one of them was a graph representing the, sort of, zone one of the London underground.
And we were able to ask questions like, well, can you find the shortest path between Moorgate and Piccadilly Circus? Or can you perform a traversal when you start at Oxford Circus and follow the Central line and the Circle line and so forth.
And it was able to do this because they could store the graph in memory and then selectively kind of recall elements of the graph.
And similarly, we asked it some questions about a family tree where it had to determine complex relationships like Maternal Great Uncle.
In order to do that, it had to keep track of all the links in the graph.
So for the remainder of the lecture, we’re going to look at some further topics in attention and deep learning.
So one type of deep network that’s got a lot of attention recently is known as Transformer networks.
And what’s really interesting about Transformers, as they’re often known, from the point of view of this lecture is that they are, they really take attention to the logical extreme.
They basically get rid of everything else that all of the other components that could have been present in similar deep networks such as the recurrent state present in recurrent neural networks, convolutions, external memory like we discussed in the previous section, and they just use attention to repeatedly transform a data sequence.
So it’s a, the idea that the paper that introduced Transformers was called ‘Attention Is All You Need’, and that’s really the fundamental idea behind them is that this attention mechanism is so powerful it can essentially replace everything else in a deep network.
And so the attention, the form of attention used by Transformers is a little, it’s mathematically the same as the attention we’ve looked at before but the way it’s implemented in the network is a little bit different.
So instead of there being a controller network as there was in the Neural Turing Machine, for example, that emits some set of attention parameters that are treated as a query, rather, what you have is that every vector in the sequence emits its own query and compares itself with every other.
And sometimes I think of this as a sort of emergent or anarchist attention where the attention is not being dictated by some central control mechanism, but it’s rather arising directly from the data.
And so in practice, you know, what this means is you have quite a similar calculation to quite a similar attention mechanism to the content-based attention we’ve discussed previously where there’s a cosine similarity calculated between a set of vectors.
But the point is that there’s a separate key being emitted for every vector in the sequence that’s compared with every other.
And like as with NTM and DNC, there are multiple attention heads that are used, so in fact, every point in the input sequence gives not just one set of, not just one attention key to be compared with the rest of the sequence but several.
So I’m not going to go very much into the details of, you know, how Transformers work.
Although the attention mechanism is straightforward, the actual architecture itself is fairly complex.
And actually I recommend this blog post, ‘The Annotated Transformer’, for those of you who want to understand it in more detail.
But if we look at the kinds of operations that emerged from the system is very intriguing.
So, as I’ve said, there’s this, the idea is that a series of attention mechanisms are defined, so Transformers I should say are particularly successful for natural language processing.
And I think the reason for that and the reason that recurrent neural networks for attention were also first applied to languages that in language this idea of being able to attend to things that are widely separated in the input is particularly important.
So this one word at the start of the paragraph might be very important when it comes to understanding something much later on in the paragraph.
And there may be, if you’re trying to extract, for example, the sentiment of a particular piece of text, there may be several words that are very well, that are very spaced out that are required in order to make sense of it.
So it’s a sort of a, it’s a natural fit for attention-based models.
So in this particular example here from the paper we see that when the, so the network has processed, has created keys and vectors for each element in the input sequence and then, and this creates a, you know, a sequence that, a sequence of embeddings equal to length of the original sequence.
And then this process is repeated at the next level up.
So the network basically now defines another set of key vector pairs at every point along the sequence.
And those key vector pairs are then compared with the original embeddings to create these attention masks.
And so we see for this word ‘making’ here this is being, while this word is being processed, I forget what the exact task was here, but while this word ‘making’ was being processed, it was attending to a bunch of different words: ‘laws’, ‘2009’, the word ‘making’ itself, but also this phrase ‘more difficult’ here in the end of the sequence.
And so all of these things are tied up with the semantics of how this word ‘making’ is used in the sentence.
And generally what you find is you get these different, as I said, there are multiple attention vectors being defined for each point in the sequence, and what you get are different patterns of attention emerging.
So, for example, in this example here, and this is showing the kind of the all to all attention.
So how the embedding corresponding to each word in the sentence at one level is attending to all of the embeddings at another level.
And we see that this one is doing something quite complex.
It seems to be looking for phrases so ‘what’, the word ‘what’ is kind of is tending to ‘this is what’; ‘the law’, the word ‘law’ is attending to ‘the law’ and ‘it’s’ and so forth.
So there’s some kind of, you know, complicated integrating of information going on.
Whereas here, another one of the sort of attention masks for the same network, we see that it’s doing something much simpler, which is it’s just attending to nearby words.
And so the overall effect of having access to all of these attention mechanisms is that the network can learn a really rich set of transforms for the data.
And what they realise with the Transformer network is that just by repeating this process many times they could get a very, very powerful model in particular of language data.
And so from the original paper, they already showed that, you know, the transformer was achieving state-of-the-art results from machine translation.
Since then it’s kind of gone from strength to strength: it now provides state-of-the-arts for language modeling.
It’s also been used for other data types besides language: it’s been used for speech recognition, it’s been used for two dimensional data such as images.
But from this blog post here, this was posted by OpenAI in 2019, we can see just how powerful a Transformer-based language model can be.
So in language modeling essentially just means predicting, iteratively predicting the next word in a piece of text or the next subword symbol.
And so in this case, once the language model is trained, it can be given a human prompt and then you can generate from it just by asking it to predict what word it thinks will come next.
And then feeding that word in and repeating this whole Transformer-based network that is able to attend to all of the previous context in the data.
And what’s really interesting about this text relative to texts that have been generated by language models in the past is that it manages to stay on topic, it’s that it manages to keep the context intact throughout a relatively long piece of text.
So having started off talking about a herd of unicorns living in an unexplored valley in the Andes, it continues to talk about unicorns, it continues to, it keeps the setting constant in the Andes Mountains, it, you know, it invents a biologist from the University of La Paz.
And once it’s made these inventions, for example, once it’s named the biologist, it keeps that name intact.
So having called it Pérez once, it knows to keep on calling them Pérez throughout.
And the reason it can do that is that it has this really powerful use of context that comes from having this ability to attend to everything in the sequence so far.
So what attention is really doing here is allowing it to span very long divides, very long separations in the data.
And this is something that even, you know, before attention was introduced, even the most powerful recurrent neural networks, such as LSTM, they struggled to do because they had to store everything in the internal state of the network which was constantly being overwritten by new data.
So between the first time Pérez is introduced and the last time there might’ve been, you know, several hundred updates of the network and this information about Pérez would attenuate during these updates.
But attention allows you to kind of remove this attenuation, allows you to bridge these very long gaps and that’s really the secret to its power, particularly for language modeling.
And so, one interesting extension to; and so I should say there’ve been many, many extensions to Transformer networks and they have, kind of, you know, gone from strength to strength, particularly with language modeling.
One extension that I’d like to look at in this lecture, which I find very interesting, is known as Universal Transformers.
So the idea here is that basically the weights of the Transformer are tied across each transform.
So the transforms here, if we look at this model here, if we have the input sequence over time along the x-axis, then this, the effect of the Transformer is to generate a set of self-attention masks at each point along the sequence.
And then all of the embeddings associated with these points are then transformed again the next level up and so on.
Now on an ordinary Transformer these parameters that happen, the parameters of the self-attention operations are different at each point going up at each transform, at each level going up on the y-axis.
And this means that the functional form of the transform that’s being enacted is different at each step.
So tying these weights going up through the stack makes it act like a recurrent neural network in depth.
So what you have is something like a recursive transform.
And what’s interesting about that is that you then start to have something that can behave a little bit more algorithmically, something that is not only very good with language, but is good at learning the sorts of functions, the sorts of algorithmic behaviours that I talked about in the Neural Turing Machine section.
And so, this could be seen from some of the results for Universal Transformers where it was applied, for example, to the bAbI tasks, which are a set of kind of toy linguistic tasks generated using a grammar.
And so a related topic to this idea, so there’s this idea here of having a recursive transform and; oh and the other thing I should say is that because the weights are tied it means you can enact this transform a variable number of times and just the same way that you can run an RNN through a variable length sequence.
So now we have something where the amount of time it spends transforming each part of the data can become variable, it can become data dependent.
And so this relates to work that I did in 2016 which I called Adaptive Computation Time.
The idea of Adaptive Computation Time was to change, so with an ordinary recurrent neural network there’s a one-to-one kind of, correspondence between input steps and output steps.
Every time an input comes in the network admits an output.
And the problem with this in some sense is that ties, it ties Computation Time to what we could call Data Time.
There’s one tick of computation for every step in the data.
And now you can alleviate this by stacking lots of layers on top of each other so now you have multiple ticks of computation for each point in the input sequence.
But the idea of Adaptive Computation Time was that maybe we could allow the network to learn how long it needed to think in order to make each decision.
And we called this the amount of time it spent pondering each decision so the idea is some network, some input comes in at timestep x₁, the network receives a hidden state that, its hidden state from the previous timestep as usual for a recurrent neural network, and it then thinks for a variable number of steps before making a decision.
And this variable number of steps is determined by a Halting Probability.
So you can see these numbers 0.1, 0.3, 0.8, the idea is that when that probability passes a threshold of one, the network, when the sum of these possibilities passes the threshold of one, the network is ready to emit an output and move on to the next timestep.
And so what’s sort of the relevance of this mechanism to the rest of this lecture, which has been about more explicit attention mechanisms, is that, in some sense, the amount of time a person or even a neural network spends thinking about a particular decision that it makes is strongly related to the degree to which it attends to it.
So, I mean, there have been cognitive experiments done with people where you can by measuring the amount of time it takes them to answer a particular question, you can sort of measure the amount of attention that they’re needing to give to that question.
And so if we look at this concretely, if we look at what happens with this Adaptive Computation Time, if we apply this to a recurrent neural network, this is an ordinary LSTM network, that is being applied to language modeling, in this case next step character prediction, and what this graph is showing is the y-axis shows the number of steps that the network stopped to think for.
Now, this number of steps is actually not, it’s not integer because there’s this question where it can slightly overrun a complete step, but that’s not really important.
What’s important here is that there’s a variable amount of computation going on for each of these predictions that it has to make and you can immediately see a pattern.
So for example, the amount of ponder time goes up when there’s a space between words.
And the reason for this is that it’s at the start of words that we need to spend the longest thinking because it’s basically, it’s easier once you’ve gone most of the way through a word, it’s easy to predict the ending.
Once you’ve seen ‘p’, ‘e’, ‘o’, ‘p’, ‘l’, it’s pretty easy to predict that ‘e’ is going to come next.
Once you see a space after that ‘e’ then it becomes harder.
Now you have to think, well, ‘and the many people’, what word could come next? So this takes a little bit more thought and then it tends to drop down again and it spikes up even further when it comes to a kind of larger divider, like a full stop, like a comma.
So there’s this very close, if we think about, if we think back to the plots I showed you at the very start of this lecture were to do with implicit attention where we saw that a deep network or recurrent neural network will concentrate in some sense or will respond more strongly to certain parts of the sequence, we kind of see that same pattern emerge again when we give it a variable amount of time to think about what’s going on in the sequence.
And there’s some interesting consequences here.
So for example, one is that the network is only, because this is a question of how long it needs to take in order to make a particular prediction, and that work is only interested in predictable data.
So, for example, if you see these ID tags, this is from Wikipedia data which contains you know, XML tags as well, we can see that the network does not, there’s no spike in sort of thinking time when it comes to these ID numbers.
And this is kind of interesting because these ID numbers are hard to predict.
So it isn’t simply that the network thinks longer whenever it finds something that’s harder to predict, it thinks longer when it sort of believes that there’s a benefit to thinking longer, when thinking longer is likely to make it better able to make a prediction.
And the reason it would be better able to make it a prediction is that it allows it to spend more time processing the context on which that prediction is based.
So this kind of goes back again to the idea we talked about in Transformers of having these repeated steps of contextual processing as being the thing that builds up the information the network needs to make a prediction.
And so there’s a nice combination of this idea of Adaptive Computation Time with these Universal Transformer models and so in this case here, we have a task from the bAbI dataset where there’s a series of sentences, so these sentences presented along the x-axis are kind of like the input for the network then, or these are the context that the network needs to know about, and then it gets asked the question.
The question here was: ‘where was the apple before the bathroom?’ And if you go through all of these sentences, and I think I’ve cropped this graph so it doesn’t have all of them, but you can see that things are happening with the apple: ‘John dropped the apple’, ‘John grabbed the apple’; ‘John went to the office’, so we think the apple at this point is probably in the office.
‘John journeyed to the bathroom’, well, maybe now it’s gone to the bathroom.
But in between those two things were some pieces of information that weren’t relevant: ‘Sandra took the milk’, for example.
‘John traveled to the office’, we’re back in the office again.
So there’s a little puzzle here that the network has to work out as to where the apple has ended up.
And of course some parts of the sequence are important for that puzzle and some parts aren’t.
‘John discarded the apple there’, well of course that’s very important.
Basically all of the ones that mentioned where John is are important and generally those are the ones that the network spends longer thinking about.
So we’re kind of, via this Adaptive Computation Time and via this Transformer model where, you know, at every step in time, each point along this sequence is attending to all of the others.
But we build up a similar picture in some sense to the one we had at the start of the lecture where we can see that the network has learned to focus more on some parts of the sequence than others.
And so once again, this is what attention is all about, it’s about ignoring things and being selective.
And so to conclude, I selective.
And so to conclude, I think the main point I would like to get across in this lecture is that selective attention appears to be as useful for deep learning as it is for people.
As we saw at the start of the lecture, implicit attention is always present to some degree in neural networks just because they’ve learned to become more sensitive to certain parts of the data than to others.
But we can also add explicit attention mechanisms on top of that and it seems to be very beneficial to do so.
These mechanisms can be stochastic, so-called hard attention, that we can train with reinforcement learning or they can be differentiable, so-called soft attention, which can be trained with ordinary backprop and end-to-end learning.
And we can use attention to attend to memory or to some internal state of the network as well as to data.
So many types of attention mechanism have been defined and I should say that even the ones I’ve covered in this lecture only cover a small fraction of what’s being considered in the field and many more could be defined.
And I think, and what’s become very clear over the last few years is that you can really get excellent results, state-of-the-art results in sequence learning by just using attention, by using Transformers that essentially get rid of all of the other mechanisms that deep networks have for attending to long range contexts.
And that is the end of this lecture on attention and memory in deep learning.
ion appears to be as useful for deep learning as it is for people.
As we saw at the start of the lecture, implicit attention is always present to some degree in neural networks just because they’ve learned to become more sensitive to certain parts of the data than to others.
But we can also add explicit attention mechanisms on top of that and it seems to be very beneficial to do so.
These mechanisms can be stochastic, so-called hard attention, that we can train with reinforcement learning or they can be differentiable, so-called soft attention, which can be trained with ordinary backprop and end-to-end learning.
And we can use attention to attend to memory or to some internal state of the network as well as to data.
So many types of attention mechanism have been defined and I should say that even the ones I’ve covered in this lecture only cover a small fraction of what’s being considered in the field and many more could be defined.
And I think, and what’s become very clear over the last few years is that you can really get excellent results, state-of-the-art results in sequence learning by just using attention, by using Transformers that essentially get rid of all of the other mechanisms that deep networks have for attending to long range contexts.
And that is the end of this lecture on attention and memory in deep learning.
Thank you very much for your attention.
