[深度学习论文笔记][Attention] Spatial Transformer Networks

Jaderberg, Max, Karen Simonyan, and Andrew Zisserman. “Spatial transformer networks.” Advances in Neural Information Processing Systems. 2015. (Citations: 116).


1 Motivation

The Show, Attend and Tell only allow attention constrained to fixed grid. We want the model can attend to arbitary part of the image.


The pooling operation allows a network to be somewhat spatially invariant to the position of features. However, due to the typically small spatial support for max-pooling, this

spatial invariance is only realised over a deep hierarchy of max-pooling and convolutions, and the intermediate feature maps in a CNN are not actually invariant to large transformations of the input data.


Our goal is to introduce a spatial transformer module, which intelligently select features of interest (attention), and transform them by scaling, cropping, rotations, and non-rigid

deformations.


2 Spatial Transformers
We want a differentiable module which applies a spatial transformation to a feature map during a single forward pass. For each pixel coordinates .x s , y s /, we compute the corresponding ouput .x t , y t / by 

We normalize coordinates x s , y s in range [-1, 1]. This transformation allows cropping, translation, rotation, scale, and skew to be applied to the input feature map.

For multi-channel inputs, the same warping is applied to each channel. We repeat for all pixels in output to get a sampling grid, and then use bilinear interpolation to compute
output.


3 Architecture
See Fig. One can also use multiple spatial transformers in parallel — this can be useful if there are multiple objects or parts of interest in a feature map that should be
focussed on individually. A limitation of this architecture in a purely feed-forward network is that the number of parallel spatial transformers limits the number of objects that the
network can model.

[深度学习论文笔记][Attention] Spatial Transformer Networks_第1张图片


4 Training Details
For training, we initialize

This allows the output to be the same as input.

5 Results
See Fig. We insert spatial transformers into a classification network and it learns to attend and transform the input. 


6 References
[1]. https://www.youtube.com/watch?v=Ywv0Xi2-14Y.
[2]. https://www.youtube.com/watch?v=T5k0GnBmZVI.

[深度学习论文笔记][Attention] Spatial Transformer Networks_第2张图片

你可能感兴趣的:(CNN,Papers)