[paper] [TensorFlow] [视频讲解]
[gif 来自微信公众号 CVer: 谷歌魔改 Transformer!一层 8 个 token 比 1024 个还好用!NeurIPS 2021]
目录
Abstract
1 Introduction
2 TokenLearner Modules for Adaptive Tokenization
2.1 TokenLearner
2.2 TokenFuser
3 Experiments with Images
3.1 Network architecture implementation
3.3 Ablation: where should we have TokenLearner?
3.4 Results
3.5 Ablations and Comparisons
4 TokenLearner for Videos
5 Experiments with Videos: TokenLearner with Video Vision Transformer
5.1 Network architecture implementation
5.3 Results
6 Experiments with Videos: TokenLearner with Bottleneck Transformer
6.1 Network architecture implementation
6.2 Results
In this paper, we introduce a novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks.
Instead of relying on hand-designed splitting strategies to obtain visual tokens and processing a large number of densely sampled patches for attention, our approach learns to mine important tokens in visual data. This results in efficiently and effectively finding a few important visual tokens and enables modeling of pairwise attention between such tokens, over a longer temporal horizon for videos, or the spatial content in images.
Our experiments demonstrate strong performance on several challenging benchmarks for both image and video recognition tasks. Importantly, due to our tokens being adaptive, we accomplish competitive results at significantly reduced compute amount. We obtain comparable results to the state-of-the-arts on ImageNet while being computationally more efficient. We establish new state-of-the-arts on multiple video datasets, including Kinetics-400, Kinetics-600, Charades, and AViD.
一句话概括本文的研究内容:
本文介绍了一种新的视觉表征学习,它依赖于少量自适应学习的 token,适用于图像和视频的理解任务。
本文研究成果的特点和优点:
本文的方法学习挖掘视觉数据中的重要 token,而不是依靠手工设计的分割策略来获取视觉 token,并处理大量密集采样的 patch 进行注意。这可以有效地找到一些重要的视觉 token,并能够在较长时间内为视频或图像中的空间内容对这些 tokens 之间的成对注意进行建模。
客观可信的实验结论:
实验证明了在图像和视频识别任务的几个有挑战性的 benchmarks 上有很强的性能。重要的是,由于本文的 token 是自适应的,该方法在显著减少计算量的情况下完成具有竞争力的结果。在 ImageNet 上获得了与先进水平相当的结果,同时计算效率更高。在多个视频数据集上建立了新的先进水平,包括 Kinetics-400、Kinetics-600、Charades和AViD。
Images and videos provide an abundance of visual information. Image understanding is a long standing problem in computer vision, and despite incredible advances, obtaining the best visual representation for a variety of image understanding tasks is still an active area of research. Videos, in addition to addressing a similar image understanding task, require employing effective spatial-temporal processing of both RGB and time streams to capture long-range interactions [6, 39, 23, 19, 26, 14, 36, 22, 27, 1]. An important aspect of this understanding is how to quickly learn which parts of the input video stream are important, both spatially and temporally, and to focus computational resources on them. But what basic processing mechanism are able to do so successfully for both images and videos?
提出终极问题:
图像和视频提供了丰富的视觉信息。图像理解是计算机视觉中一个长期存在的问题,尽管取得了令人难以置信的进展,为各种图像理解任务获得最佳的视觉表示仍然是一个活跃的研究领域。视频,除了解决类似的图像理解任务,需要使用有效的 RGB 和时间流的时空处理来捕获远程交互。这种理解的一个重要方面是如何快速了解输入视频流的哪些部分在空间上和时间上是重要的,并将计算资源集中在它们上面。但是什么基本的处理机制能够同时成功地处理图像和视频呢?
Recent advancements in image understanding demonstrate improved accuracy on vision classification tasks. For example, departing from standard convolutional approaches, the Vision Transformer (VIT) [11] treats the image as a sequence of patches, utilizing the Transformer architecture [41] similar to text understanding.
Standard approaches for video recognition take videos as stacked images (i.e., a space-time volume) and tend to extend 2D neural architectures to 3D (e.g., 3D-ResNets [19, 6, 40, 13]). In parallel to the visual Transformer for images, some approaches [2, 3] proposed to create 3D ‘cubelet’ video tokens on a regular 3D-grid which are further processed by a Transformer, resulting in computationally heavy models. There are too many tokens to process, especially for longer videos.
ViT 及其问题:本文拟采用 ViT 解决上述问题,并进一步在 ViT 上进行探索和改进
图像 理解方面的最新进展表明,视觉分类任务的准确性得到了提高。例如,与标准的卷积方法不同,Vision Transformer (VIT) 将图像视为一个 patch 序列,利用类似于文本理解的 Transformer 架构。
视频 识别的标准方法将视频视为堆叠图像 (即时空体积),并倾向于将 2D 神经结构扩展到 3D (如 3D- ResNets)。与此同时,用于图像的可视化 Transformer 被应用在常规的 3D 网格上创建 3D ' cubelet ' 视频 tokens,这些 tokens 被 Transformer 进一步处理,导致计算量大的模型。有太多的 tokens 需要处理,特别是对于较长的视频。
The main question addressed in this work is how to adaptively learn the representation from visual inputs to most effectively capture the spatial information for images and spatio-temporal interactions for videos. Here are our main ideas:
The first key observation is we are able to learn to represent visual data by learning to ‘tokenize’ the input. This is in contrast to previous approaches which densely sampled tokens e.g., 16x16 or 32x32 for either images or videos [11, 3]. Specifically, we can learn to compute important regions in the input image/video, making the tokens adapt to the input data. We compute multiple spatial weight maps per frame with a spatial attention mechanism, and use it for the tokenization. The goal of these maps is to learn which areas are of importance. Here, each spatial weight map is multiplied with the input to form a ‘token’, to be processed by the subsequent learning modules.
Furthermore, we find that very few tokens may be sufficient for a visual understanding task. More specifically, for images we show that one can significantly reduce the computational budget of the Vision Transformer, when inserting 8-16 tokens as an intermediate representation (instead of keeping 200∼500). Our TokenLearner is able to reduce the number of total FLOPS by half, while maintaining or even increasing the classification accuracy. Similarly, for video recognition we show improved performance over the state-of-the art on three challenging datasets while using only 8-16 intermediate tokens per frame.
本研究的主要问题是如何自适应地学习视觉输入的表示,以最有效地捕捉图像的空间信息和视频的时空交互。以下是我们的主要想法:
第一个关键观察是,能够通过学习对 输入进行 “tokenize”,来表示可视化数据。这与以前的方法不同,以前的方法对图像或视频密集采样 tokens,例如 16x16 或 32x32。具体来说,本文通过学习计算输入图像/视频中的重要区域,使 token 适应输入数据。本文利用空间注意机制计算每帧多个空间权重图,并将其用于 tokenization。这些地图的目的是了解哪些地区是重要的。在这里,每个空间权重图与输入相乘,形成一个 “token”,由后续的学习模块处理。
此外,本文发现对于视觉理解任务来说,很少的 tokens 就足够了。更具体地说,对于图像,本文表明,当插入 8-16 个 tokens 作为中间表示 (而不是保留 200 或500) 时,可以显著减少 Vision Transformer 的计算预算。本文的 TokenLearner 能够将总 FLOPS 减少一半,同时保持甚至提高分类精度。类似地,对于视频识别,在三个具有挑战性的数据集上展示了比现有技术更好的性能,同时每帧只使用 8-16 个中间 tokens。
性能结论:
The approach is simple, efficient, and, as shown by the results, outperforms methods including both convolutional methods and previous space-time Transformer ones from prior art. We demonstrate that our method performs comparably to previous Transformer models on ImageNet while meaningfully reducing the computation cost. In video understanding tasks, we establish new state-of-the-art numbers on Kinetics400, Kinetics600, Charades, and AViD datasets by outperforming prior models.
结果表明,该方法简单、高效,优于包括卷积方法和先前技术中的空时转换器方法。本文的方法在 ImageNet 上执行的性能与以前的 Transformer 模型相当,同时有意义地降低了计算成本。在视频理解任务中,通过超越之前的模型,在Kinetics400、Kinetics600、Charades 和 AViD 数据集上达到 SOTA。
In vision transformer architectures such as ViT [11], an input image is first tokenized by splitting it into small (e.g., 16x16) spatial patches, which are used as input to the model. Similarly, in recent video transformer architectures, such as ViViT [2] and TimeSformer [3], the video is tokenized by cutting the video into 2d spatial or 3d spatio-temporal cubes on a regular grid.
Instead of processing fixed, tokenized inputs, our attention module learns the tokens that are to be used for the recognition task. We gain several important properties by doing so:
(1) We enable the adaptive tokenization so that the tokens can be dynamically selected conditioned on the input.
(2) This also effectively reduces the total number of tokens for the transformer, which is particularly beneficial considering that there are many tokens in videos (e.g., 4096) and the computation is quadratic to the number of tokens.
(3) Finally, we provide an ability for each subsequent layer to learn to rely on different space-time tokenizations, potentially allowing different layers to capture different aspects of the video. These dynamically and adaptively generated tokens can be used in standard transformer architectures such as ViT for images and ViViT for videos, or can be used within the specialized video architecture which we discuss further in Section 4.
本文方法概括描述:与之前方法的不同和三个特点
在像 ViT 这样的 vision transformer 架构中,输入图像首先通过将其分割成小的 (例如,16x16) 空间 patch 来 tokenized,这些空间 patch 被用作模型的输入。类似地,在最近的 video transformer 架构中,例如 ViViT 和 TimeSformer,视频是通过将视频切割成常规网格上的 2d 空间或 3d 时空立方体来 tokenized。
本文的注意力模块学习将用于识别任务的 tokens,而不是处理固定的、 tokenized 的输入。通过这样做,获得了几个重要的属性:
(1) 启用自适应 tokenization ,以便根据输入条件动态选择 tokens。
(2) 这也有效地减少了 transformer 的 token 总数,特别是考虑到视频中有许多 tokens (如 4096),并且计算是 token 数量的二次方,这是非常有益的。
(3) 最后,为每个后续层提供了一种能力,让它们学会依赖不同的时空 tokenization,从而可能允许不同的层捕获视频的不同方面。这些动态自适应生成的 tokens 可用于标准 transformer 架构,如用于图像的 ViT 和用于视频的 ViViT,也可用于本文将在第 4 节中进一步讨论的专用视频架构。
[ 本段描述了 1. 方法的提出动机;2. 方法的功能分析。第 2 点往往体现出作者逻辑分析的能力和知识储备的深厚。]
Let X be an input tensor with a space-time shape: X ∈ R^{T×H×W×C} where H × W corresponds to the spatial dimension of the input, T is the temporal dimension (i.e., number of frames), and C is the number of channels. Let X_t be a temporal slice of it, corresponding to the frame t: X_t ∈ R^{H×W×C}. In the case of an image input, T = 1 and X = X_t. Note that X could also be an intermediate representation within a network, and X_t will be its slice in such case.
符号基本设定:
设 X 为时空输入张量: X ∈ R^{T×H×W×C},其中 H × W 对应输入的空间维数,T 为时间维数 (即帧数),C 为通道数。设 X_t 为其时间切片(slice),对应于框架 t: X_t ∈ R^{H×W×C}。对于图像输入,T = 1 和 X = X_t。注意,X 也可以是网络中的中间表示,在这种情况下,X_t 为其切片。
For every time frame t, we learn to generate a series of S tokens, Z_t = [z_i ]^S_{i=1}, from the input frame X_t. Specifically, we formulate a tokenizer function, z_i = A_i(X_t), which maps the input frame X_t to a token vector z_i : R^{H×W×C} |→ R^C. The idea is to learn our tokenizer function A_i to adaptively select an informative combination of pixels (or spatial locations) in X_t, and we have S number of such functions. This way, our tokens will not be fixed splits of the input tensor, but a set of adaptively changing spatial selections. Different tokens will be mined per frame, allowing us to model their space-time relations/interactions in case of videos. We also set S to be smaller than H × W (e.g.,S = 8 and H × W = 32 × 32), enabling the model to significantly reduce the computations needed for the layers following this module.
对于每一个时间帧 t,学习从输入帧 X_t 生成 S 个 tokens,Z_t = [z_i]^S_{i=1}。
具体地说,本文制定了一个 tokenizer 函数 z_i = A_i(X_t),它将输入帧 X_t 映射到一个记号向量z_i : R^{H×W×C} |→ R^C。
其思想是,学习 tokenizer 函数 A_i,以自适应地选择 X_t 中的像素 (或空间位置) 的信息组合,g共有 S 个这样的函数。
这样,token 将不是输入张量的固定分割,而是一组自适应变化的空间选择。
不同的 token 将从每帧中被挖掘出来,允许建模他们的时空关系/内在联系。
本文还将 S 设置为小于 H × W (例如,S = 8 和 H × W = 32 × 32),使模型能够显著减少该模块后面的层所需的计算。
Here, our tokenizer z_i = A_i(X_t) is implemented with a spatial attention mechanism: i.e., the model learns to compute a weight map (of size H × W) conditioned on the input X_t, and is multiplied with X_t itself. More specifically, let α_i(X_t) be a function generating the spatial H × W × 1 weight map. Each token z_i is generated by
where is the Hadamard product (i.e., element-wise multiplication) and A_{iw} ∈ R^{H×W×C} is an intermediate weight tensor computed with the function α_i(X_t) and the broadcasting function γ(·). Finally, spatial global average pooling ρ(·) is applied on top of them to reduce the dimensionality to R^C . The resulting tokens are gathered to form the output tensor: Z_t = [z_i]^S_i=1 ∈ R^{S×C} .
The overall process has a form of an element-wise spatial self-attention. In our version, {α_i(·)}^S_i=1 are implemented together as a single or a series of convolutional layers (with the channel size S) followed by a sigmoid function, although this could be extended with other implementations. In case of an image, Z = Z_t. In the case of a video, the tokens Z_t from all the frames are collected to form the final output token tensor Z ∈ R^{ST×C} .
We specifically name our token learning module as “TokenLeaner”. Figure 1 visually summarizes the TokenLearner module.
tokenizer --> 空间 attention:
tokenizer z_i = A_i(X_t) 是通过空间注意机制实现的:即,模型学习计算一个基于输入 X_t 的权重图 (大小为H × W),并与 X-t 本身相乘。更具体地说,设 A_i(X_t) 为生成空间 H × W × 1 权重图的函数。每个 token z_i 是由 (1) 计算得到。
其中 为 Hadamard 积( 即元素相乘),A_{iw}∈R^{H×W×C} 是一个由函数 α_i(X_t) 和广播函数 γ(·) 计算的中间权张量。最后,在它们之上应用空间全局平均 pooling ρ(·) 将维数降至 R^C。将得到的 token 集合起来形成输出张量: Z_t = [z_i]^S_i=1 ∈ R^{S×C} 。
整个过程有一个元素的空间自注意的形式。本文中,{α_i(·)}^S_i=1 作为单个或一系列卷积层 (通道大小为 S) 和一个 sigmoid 函数一起实现,尽管这可以通过其他实现进行扩展。对于图像,Z = Z_t。在视频的情况下,收集所有帧的 token Z_t,形成最终输出的 token 张量 Z ∈ R^{ST×C} 。
该 token 学习模块命名为T okenLeaner。图 1 为 TokenLearner 模块 (亦可参见博客开头的 GIF)。
Compute reduction in Transformers:
The learned tokens (i.e., the outputs of the TokenLearner Z) are provided to the subsequent layers for the visual representation learning, such as multi-head selfattention (MHSA) used in Vision Transformer and ViViT. With the TokenLearner, these subsequent layers only need to process a small number of tokens (e.g., 8 instead of 1024) and this significantly reduces the computations, as they are quadratic to the number of tokens. Figure 3 (a) shows a basic architecture inserting the TokenLearner module within ViT. It could be added at any location within the network, and the relative compute of the Transformer layers after the TokenLearner become almost negligible due to the huge difference in the number of tokens.
分析本文方法是如何减少 Transformer 中的计算的:
学习到的 token (即 TokenLearner 的输出 Z) 被提供给后续的层进行视觉表示学习,例如在 Vision Transformer 和 ViViT 中使用的多头自注意 (MHSA)。使用 TokenLearner,这些后续层只需要处理少量的 token (例如,8 而不是 1024),这显著减少了计算,因为它们是 token 数量的二次型。图 3 (a) 显示了在 ViT 中插入 TokenLearner 模块的基本架构。它可以添加到网络中的任何位置,由于token 数量的巨大差异,在 TokenLearner 之后 Transformer 层的相对计算几乎可以忽略不计。
After the TokenLearner generates tokens and its subsequent Transformer layer (e.g., MHSA) processes them, the “TokenFuser” could be used to further (1) fuse information across the tokens and (2) remap the representation back to its original spatial resolution. This enables the model to capture spatial (or spatio-temporal) ‘patterns’ formulated by the tokens, and recover the original input tensor shape when necessary.
在 TokenLearner 生成 token 并其后续的 Transformer 层(例如,MHSA) 处理它们之后,“TokenFuser” 可以用于进一步 (1) 在 token 之间融合信息,(2) 将表示重新映射回其原始的空间分辨率。这使得模型能够捕获由 token 制定的空间 (或时空) “模式”,并在必要时恢复原始输入张量形状。
First, given the token tensor Y ∈ R^{ST×C} from a Transformer layer, we apply a linear layer (i.e., a fully connected MLP layer) over the tokens, not channels. That is, we learn a linear function of R^{ST} |→ R^{ST} where S is the number of our tokens mined per frame and T is temporal size of the input tensor, and apply it to every channel independently. That is, we update Y = (Y^T M)^T where M is a learnable weight matrix with size ST × ST. The result of such operation maintains the tensor size of ST × C. We believe this also has a connection to the observations from the concurrent work, MLPMixer [38], that it is beneficial to have token-wise linear layers capturing patterns formed by tokens.
Next, the TokenFuser processes each temporal slice Y_t ∈ R^{S×C} individually, and remaps the token tensor of size S × C back to H × W × C, by learning to combine the tokens for each spatial location in H × W differently.
where X^j_t is the residual input to the previous TokenLearner module, Y_t is the processed tokens in the TokenFuser module, and X ^{j+1}_t is the output. B_w ∈ R^{HW×S} is an intermediate weight tensor computed with the function β_i(X_t). The function β_i(X_t) is implemented with a simple linear layer followed by a sigmoid function. Figure 2 illustrates the overall process of the TokenFuser (the token-wise linear layer is omitted).
首先,给定来自 Transformer 层的 token 张量 Y ∈ R^{ST×C},在 tokens 上应用一个线性层 (即一个完全连接的 MLP 层),而不是通道。也就是说,学习了 R^{ST} |→ R^{ST} 的线性函数,其中 S 是每帧挖掘的 token 数,T 是输入张量的时间大小,并将其独立应用于每个通道。更新 Y = (Y^T M)^T 是一种可学的 ST × ST 的权重矩阵,这样的操作的结果保持 尺度 ST × C。作者认为这与 MLPMixer[38] 所发现的一样,它有利于使用基于 token 的线性层来捕获由 token 形成的模式。
然后,TokenFuser 分别处理每个时间切片 Y_t ∈ R^{S×C},并通过学习不同地组合 H × W 中每个空间位置的 token,将大小为 S × C 的 token 张量重新映射回 H × W × C。其中 X^j_t 是前一个 TokenLearner 模块的残差输入,Y_t 是 TokenFuser 模块中处理过的 token,X^{j+1}_t 是输出。B_w ∈ R^{HW×S} 是一个用 β_i(X_t) 函数计算的中间权张量。函数 β_i(X_t) 是由一个简单的线性层和一个 sigmoid 函数实现的。图 2 说明了 TokenFuser 的总体过程 (省略了 token 的线性层)。
In order to validate the power of the TokenLearner module, we first try TokenLearner on image representation learning. We evaluate two different architectures: (a) simply inserting the TokenLearner within standard transformer models, and (b) using the TokenFuser in addition to the TokenLearner at multiple locations within the transformers.
为了验证 TokenLearner 模块的强大功能,首先在图像表示学习上尝试了 TokenLearner。本文评估了两种不同的架构: (a) 简单地将 TokenLearner 插入到标准 transformer 模型中,(b) 在 transformer 的多个位置除了使用 TokenLearner 外,还使用 TokenFuser。
We use the Vision Transformer architecture [11], following its detailed settings and implementation [8]. We use ViT-B/16 and ViT-L/16 as our backbone, while also applying the TokenLearner to ViT-B/32 (the same model but with an initial patch size of 32x32 in the beginning), ViT-S/32 (smaller version with 384 channels), and more. The ViT-S and ViT-B backbone models have 12 transformer layers, while ViT-L has 24. Following the exact setting of [11], we used the input resolution of 224x224, 384x384, or 512x512 depending on the dataset and the model (i.e., 196, 576, or 1024 tokens). Positional encodings identical to ViT are used.
使用 Vision Transformer 架构 [11],并遵循其详细设置和实现 [8]。
我们使用 ViT- B/16和 ViT- L /16 作为本文的 backbone,同时也将 TokenLearner 应用于 ViT- B /32 (相同的模型,但最初的 patch 大小为 32x32),ViT- S/32 (更小的版本,有 384 个通道),等等。ViT- S 和 ViT- B backbone 模型有 12 个 transformer 层,而 ViT- L 有 24 个。按照 [11] 的精确设置,本文使用 224x224、384x384 或 512x512 的输入分辨率,这取决于数据集和模型 (即 196,576或 1024 token)。使用与 ViT 相同的位置编码。
Figure 3 (a) and (b) show two different architectures incorporating TokenLearner. (a) is formed by inserting TokenLearner in the middle of the network such as after the 6th transformer among 12, while (b) uses both TokenLearner and TokenFuser. In particular, our model (b) is formed by replacing conventional Transformer layers with a series of TokenLearner-Transformer-TokenFuser. Similar to (a), such replacement is done only for the layers after a certain layer. For instance, we keep six of the standard Transformer MHSA layers in the beginning, and replaces the remaining six layers with our TokenLearner-Transformer-TokenFuser modules repeated six times. We also modified some of our models to have more transformer layers (e.g., 21 instead of 12), and we specify this when we do so. Note that the computation increase caused by the transformer layers added after TokenLearner module is very small, as the number of tokens in these layers are few: 8 or 16.
图3 (a) 和 (b) 显示了使用 TokenLearner 的两种不同架构。(a)通过在网络中间插入 TokenLearner (例如在 12 个 transformer 中的第 6 个 transformer 之后) 形成,而(b)同时使用 TokenLearner 和 TokenFuser。特别地,本文的模型 (b) 是通过使用一系列 TokenLearner-Transformer-TokenFuser 替换传统的 Transformer 层而形成的。与 (a) 类似,这种替换只对某一层之后的层进行。例如,在开始时保留 6 个标准 Transformer MHSA 层,并使用 TokenLearner-Transformer-TokenFuser 模块重复 6 次来替换剩下的 6 个层。本文还修改了一些模型,使其具有更多的 transformer 层 (例如,21 而不是 12),并且在这样做时指定了这一点。注意,在 TokenLearner 模块之后添加 transformer 层所增加的计算量非常小,因为这些层中的 token 数量很少: 8 个或 16 个。
We tried various number of tokens including S = 8, 16, 32, and use S = 8 and 16 as our default settings. That is, the TokenLearner is learning to abstract an image into 8 (or 16) tokens. The spatial attention function (α) in TokenLearner is implemented with four 3x3 conv. layers (with gelu in between), whose channel size is identical to the number of tokens (e.g., S = 8).
We adopt the training settings (e.g., learning rate, training epochs, etc.) of [11].
我们尝试了各种数量的 token,包括 S = 8、16、32,并使用 S = 8 和 16作为本文的默认设置。也就是说,TokenLearner 正在学习将图像抽象为 8 (或 16) 个 token。TokenLearner 中的空间注意函数 (α) 由 4 个 3x3 卷积层 (中间有 gelu) 实现,其通道大小与代币数量相同(例如S = 8)。
我们采用[11]的训练设置(如学习速率、训练时间等)。
We first conducted an ablation to decide the best location to place the TokenLearner within the model. Figure 4 shows the results. The experiment was done with our model (a), without TokenFuser. It is showing the few-shot classification accuracies on ImageNet with JFT pre-training, following the protocol of ViT [11]. In addition, we show how the computation amount (FLOPS) changes per TokenLearner location. Basically, due to the large difference between the number of tokens with and without the TokenLearner (e.g., 8 with TokenLearner vs. 196 without), the computation of the transformers after the TokenLearner module becomes almost negligible compared to the transformers before the TokenLearner location.
We found that inserting TokenLearner in the middle of the network (at 1/2) achieves almost identical accuracies, while cutting the computation by (almost) half. In addition, having the TokenLearner at the later layer (after 3/4 of the network) achieves even superior performance compared to not using the TokenLearner while performing faster, thanks to its adaptiveness.
首先进行了消融,以确定将 TokenLearner 放置在模型中的最佳位置。图 4 显示了结果。实验是用本文的模型 (a) 没有使用 TokenFuser。采用 ViT[11] 协议,通过 JFT 预训练,在 ImageNet 上 显示了少量镜头的分类准确率。此外,本文还展示了每个 TokenLearner 位置的计算量 (FLOPS) 如何变化。基本上,由于有 TokenLearner 和没有 TokenLearner 的 token 数量之间的巨大差异 (例如,有 TokenLearner 的 8 个 vs. 没有 TokenLearner 的 196 个),与 TokenLearner 位置之前的 transformer 相比,TokenLearner 模块之后的 transformer 的计算几乎可以忽略不计。本文发现在网络中间 (1/2处) 插入 TokenLearner 可以获得几乎相同的精度,同时将计算量 (几乎) 减少一半。此外,由于 TokenLearner 具有自适应能力,在较后一层 (在网络的 3/4 之后) 使用 TokenLearner 可以获得比不使用 TokenLearner 更好的性能,同时执行速度更快。
TokenFuser
First, we compare the TokenLearner models with and without the TokenFuser module. More specifically, we compared the model (a) and the model (b) from Figure 4, to confirm the effectiveness of the TokenFuser. Table 5 shows the results.
TokenFuser。首先,比较有和没有 TokenFuser 模块的 TokenLearner 模型。更具体地说,本文比较了图 4 中的模型 (a) 和模型 (b),以确认 TokenFuser 的有效性。表 5 显示了结果。
TokenLearner vs. pooling
A straightforward alternative to the TokenLearner module is the use of spatial pooling to reduce the number of tokens. It can be done by spatially rearranging the tokens to have the height and width, and then applying conventional spatial pooling. This is similar to the pooling-based MHSA module used in [12].
Table 6 compares the TokenLearner against the spatial pooling. In all these experiments, ViT L/16 model was used. We are able to observe that there is a benefit in token ‘learning’. The pooling-based token reduction does have computation similar to the TokenLearner, but it loses its accuracy compared to the base model. On the other hand, TokenLearner performs a bit better than the base model despite the low computation.
TokenLearner 与 池化。TokenLearner 模块的一个直接替代方案是使用空间池化来减少 token 的数量。可以通过在空间上重新排列 token 以获得高度和宽度,然后应用传统的空间池化来实现。这类似于 [12] 中使用的基于池化的 MHSA 模块。表 6 比较了 TokenLearner 和空间池化。所有实验均采用 ViT-L/16 模型。可以观察到 token ‘learning’ 的好处。基于池化的 token 减少确实有类似于TokenLearner 的计算,但与基本模型相比,它失去了其准确性。另一方面,尽管计算量很低,但 TokenLearner 的表现比基本模型要好一些。
TokenFuser alternatives
Here, we experimentally compare the proposed TokenFuser module with its alternatives. The role of the TokenFuser is to mix the output tokens from the Transformer layer and map it back to the original shape before the token reduction. The most straightforward alternative would be to (1) use the masks from the TokenLearner module to ‘unpool’ the output tokens. The idea is to multiply each output token with the corresponding spatial map computed during the previous TokenLearner module, and sum all of them to recover the original input tensor shape. Alternatively, (2) we can use one more transformer layer to increase the number of tokens back to the original number of tokens, similar to the ‘re-projection’ used in [45].
Figure 7 shows the results with B/16. The unpooling strategy performed worse. The reprojection strategy performed comparably to the TokenFuser, but required more FLOPS.
TokenFuser 替代品。本文实验性地比较了 TokenFuser 模块和它的替代模块。TokenFuser的作用是混合Transformer层的输出 token,并将其映射回 token 减少之前的原始形状。最直接的替代方法是 (1) 使用来自 TokenLearner 模块的 mask 来取消输出 token 池化。其思想是将每个输出 token 与在前一个 TokenLearner 模块中计算的相应空间映射相乘,并将它们相加以恢复原始的输入张量形状。或者,(2) 可以使用另一个转换器层将 token 数量增加到原来的 token 数量,类似于在 [45] 中使用的重新投影。图 7 显示了使用 B/16 时的结果。无池化策略的表现更差。重投影策略的执行与 TokenFuser 相当,但需要更多的 FLOPS。
In this section, we illustrate how TokenLearner works for video representations. The TokenLearner and TokenFuser modules introduced in Section 2 are directly applicable for video representation learning. The only difference between the TokenLearner used for images and it used for videos is that TokenLearner generates multiple Z_t for videos and they need to be stacked to form Z. Once Z is generated, any standard Transformer layers could be used to parse them jointly.
本节将演示 TokenLearner 如何用于视频表示。第 2 节介绍的 TokenLearner 和 TokenFuser 模块直接适用于视频表示学习。用于图像的 TokenLearner 和用于视频的 TokenLearner 之间的唯一区别是,TokenLearner为视频生成多个 Z_t,它们需要被堆叠成 Z。一旦Z生成,任何标准 Transformer 层都可以用来联合解析它们。
Figure 8 provides an overview of the combined framework for videos. TokenLearner first extracts S number of tokens per frame, resulting in a total of ST tokens where T is the number of frames. Once TokenLearner generates these adaptively learned tokens, they are provided to the subsequent Transformer layer to capture the global space-time patterns. Finally (and optionally depending on the architecture), TokenFuser applies a linear layer over the token axis and then remaps the tensor shape back, as discussed in Section 2.2. Following Eq. 2, TokenFuser is applied for per-frame representation Yt. This results in a lightweight approach, which brings forth an efficient video representation by capturing long-range visual patterns.
图8概述了视频的组合框架。TokenLearner 首先从每帧中提取 S 个 token 数,得到 ST 个 token 总数,其中 T 为帧数。一旦 TokenLearner 生成这些自适应学习的 token,它们将被提供给后续的 Transformer 层以捕获全局时空模式。最后 (取决于体系结构),TokenFuser 在 token 轴上应用一个线性层,然后将张量形状重新映射回来,如第 2.2 节所讨论的。遵循 Eq. 2, TokenFuser 应用于每帧表示 Y_t。这就产生了一种轻量级的方法,它通过捕获远程可视模式产生了一种高效的视频表示。
What we show for the video representation learning is a combination of TokenLearner, Transformer, and TokenFuser modules repeated multiple times, as described in Figure 3 (b). The TokenFuser part is dropped if we are using the model architecture (a), and only the Transformer layers are repeated multiple times after the TokenLearner module.
本文展示的视频表示学习是 TokenLearner、transformer 和 TokenFuser 模块重复多次的联合体,如图 3(b) 中所描述的。 TokenFuser 部分被删除。如果使用模型架构 (a),只有 transformer 层多次重复后 TokenLearner 模块。
ViViT [2] is a direct extension of ViT for videos, which uses spatio-temporal patches from videos as its tokens. The size of the space-time patches are typically 16x16x2, which are given to the Transformer layers similar to ViT. ViViT and ViT share the architecture. For our experiments, we insert the TokenLearner module within the ViViT architecture, identically to how we inserted it within ViT in Figure 3. ViViT has been one of the state-of-the-art approaches for the Kinetics datasets [6], and the idea is to confirm whether TokenLearner could directly be added to such general video representation models and outperform them.
ViViT[2] 是视频 ViT 的直接扩展,它使用来自视频的时空 patch 作为其 token。空时 patch 的大小通常是 16x16x2,这是给 transformer 层类似于 ViT。ViViT 和 ViT 共享架构。在本文的实验中,在 ViViT 体系结构中插入 TokenLearner 模块,与在图 3 中 ViT 中插入它的方式相同。ViViT 已经成为动力学数据集的最先进的方法之一,其想法是确认 TokenLearner 是否可以直接添加到这种通用视频表示模型中并优于它们。
In this experiment, we follow the Bottleneck Transformer [35] network style, while taking advantage of X3D [13] as the backbone. This is motivated by the successful usage of X3D on Charades. Charades has longer duration videos (average of 30 seconds) with long actions (average action length of 15 seconds). This requires the model to understand longer term temporal information by considering multiple temporal tokens, and TokenLearner allows efficient computation of them over many frames. Specifically, we modified X3D to be more computationally efficient by (1) replacing its 3D XYT convolutional layers with a pair of 2D conv. layer and 1D conv. layer, and (2) removing Squeeze-andExcitation layers [20] and swish activations. Our backbone could be viewed as X(2+1)D. We use the channel sizes and the number of layers identical to X3D-M, which is an efficient model.
在这个实验中,遵循瓶颈 transformer [35] 网络风格,同时利用 X3D[13] 作为 backbone。这是由X3D 在 Charades上 的成功使用所推动的。Charades 的视频持续时间更长 (平均 30 秒),动作也更长 (平均 15 秒)。这需要模型通过考虑多个时间 token 来理解长期的时间信息,而 TokenLearner 允许在多个帧上高效地计算它们。具体来说,本文对 X3D 进行了改进,以提高计算效率,方法是(1) 用一对 2D 卷积层和 1D 卷积层替换其 3D XYT 卷积层,以及 (2) 删除挤压和 dexcitation 层[20] 和 swish 激活。我们可以把主链看成 X(2+1)D。本文使用与 X3D-M 相同的通道大小和层数,这是一个有效的模型。
Based on such X(2+1)D architecture, and following the Bottleneck Transformer concept, we replace the space-time convolution layers in the last block with our transformers. Figure 9 illustrates the residual module architecture, which is repeated multiple times in the block. We have tried different versions, and our final model is built by replacing 1D temporal convolution with our TokenLearner while keeping the 2D 3 × 3 convolution layer in the X(2+1)D modules. The spatial attention function (i.e., α(·)) in TokenLearner is implemented with a single conv2d layer.
基于这样的 X(2+1)D 架构,并遵循瓶颈 transformer 的概念,本文将最后一个块中的时空卷积层替换为 transformer。图 9 展示了残余模块架构,该架构在块中重复多次。本文尝试了不同的版本,最终本文的模型是通过使用 TokenLearner 替换 1D 时间卷积,同时在 X(2+1)D 模块中保留 2D 3 × 3 卷积层来构建的。TokenLearner 中的空间注意函数 (即 α(·)) 是通过单一的 conv2d 层实现的。
Here, we used a Vector Transformer instead of MHSA as our Transformer layer, which could be also viewed as the MHSA with the number of heads being identical to the number of channels. We provide more details in Appendix.
这里,本文用一个矢量转换器代替了 MHSA 作为 transformer 层,它也可以被看作是头部数量与通道数量相同的 MHSA。在附录中提供更多细节。
We use 224 × 224 × 64 videos for training and 256 × 256 × 64 videos for testing. After the 3rd residual block, the input tensor has the shape of 8 × 8 × 64, and this becomes the input to the TokenLearner. For an efficient implementation the intermediate channel size of TokenLearner was set identical to the output channel size, d = 432. Notice that 64 frames were used to best capture longer-term temporal information. S = 8 number of tokens were used.
本文使用 224 × 224 × 64 个视频进行训练,256 × 256 × 64 个视频进行测试。在第三个残差块之后,输入张量的形状为 8 × 8 × 64,这就成为了 TokenLearner 的输入。为了有效地实现 TokenLearner 的中间通道大小设置为与输出通道大小相同,d = 432。注意,本文使用了 64 帧来最佳地捕捉长期时间信息。S = 8 个 token 被使用。