paper
code
Model architecture for our Focal Transformers. As highlighted in light blue boxes, our main innovation is the proposed focal self-attention mechanism in each Transformer layer.
Left: Visualization of the attention maps of the three heads at the given query patch (blue) in the first layer of the DeiT-Tiny model. Right: An illustrative depiction of focal self-attention mechanism. Three granularity levels are used to compose the attention region for the blue query.
FSA attend fine-grain tokens only locally instead of attending all tokens at fine-grain
cover as many regions as standard self-attention but with much less cost
The size of receptive field (yaxis) with the increase of used tokens (x-axis) for standard and our focal selfattention. For focal self-attention, we assume increasing the window granularity by factor 2 gradually but no more than 8. Note that the y-axis is logarithmic.
for a query position, when use gradually coarser-grain for its far surroundings, FSA have significantly larger receptive fields at the cost of attending the same number of visual tokens than baseline.
focal mechanism enable long-range self-attention with much less time and memory cost
An illustration of our focal self-attention at window level. Each of the finest square cell represents a visual token either from the original feature map or the squeezed ones. Suppose we have an input feature map of size 20x20. We first partition it into 5x5 windows of size 4x4. Take the 4x4 blue window in the middle as the query, we extract its surroundings tokens at multiple granularity levels as its keys and values. For the first level, we extract the 8x8 tokens which are closest to the blue window at the finest grain. Then at the second level, we expand the attention region and pool the surrounding 2x2 sub-windows, which results in 6x6 pooled tokens. At the third level, we attend even larger region covering the whole feature map and pool 4x4 sub-windows. Finally, these three levels of tokens are concatenated to compute the keys and values for the 4x4=16 tokens (queries) in the blue window.
firstly define 3 terms for clarity
specify focal self-attention proceeded in 2 main steps
with encoder blocks containing FSA, transformer encoder computed as
z ^ l = F S A ( L N ( z l − 1 ) ) + z l − 1 z l = F F N ( L N ( z ^ l ) ) + z ^ l \begin{aligned} \widehat{z}_l&=FSA(LN(z_{l-1}))+z_{l-1} \\ z_l&=FFN(LN(\widehat{z}_l))+\widehat{z}_l \end{aligned} z lzl=FSA(LN(zl−1))+zl−1=FFN(LN(z l))+z l
in ViT, given input feature map x ∈ R h × w × C x\in R^{h\times w\times C} x∈Rh×w×C, FLOPs of MSA is
Ω ( M S A ) = 4 h w C 2 + 2 ( h w ) 2 C \Omega(MSA)=4hwC^2+2(hw)^2C Ω(MSA)=4hwC2+2(hw)2C
given input feature map x ∈ R h × w × C x\in R^{h\times w\times C} x∈Rh×w×C, h s p × w s p \frac h{s_p}\times \frac w{s_p} sph×spw sub-windows at focal level l
for pooling on each s w l × s w l {s_w}^l\times {s_w}^l swl×swl-size sub-window
Ω ( p o o l ) = ( s w l ) 2 C \Omega(pool)=({s_w}^l)^2C Ω(pool)=(swl)2C
for aggregation of sub-windows in h × w h\times w h×w feature map of each layer
Ω ( a g g r ) = h w C \Omega(aggr)=hwC Ω(aggr)=hwC
attention cost for a s p × s p s_p\times s_p sp×sp-size query window
Ω ( a t t n w i n ) = ( s p ) 2 C ∑ l ( s r l ) 2 \Omega(attn_{win})=(s_p)^2C\sum_{l}({s_r}^l)^2 Ω(attnwin)=(sp)2Cl∑(srl)2
attention cost in whole feature map
Ω ( a t t n f e a t ) = h w C ∑ l ( s r l ) 2 \Omega(attn_{feat})=hwC\sum_{l}({s_r}^l)^2 Ω(attnfeat)=hwCl∑(srl)2
to sum up, for FSA
Ω ( F S A ) = n l e v e l s × Ω ( a g g r ) + Ω ( a t t n f e a t ) = h w C ( L + ∑ l ( s r l ) 2 ) \Omega(FSA)=n_{levels}\times\Omega(aggr)+\Omega(attn_{feat})=hwC(L+\sum_{l}({s_r}^l)^2) Ω(FSA)=nlevels×Ω(aggr)+Ω(attnfeat)=hwC(L+l∑(srl)2)
Model configurations for our focal Transformers. We introduce three configurations Focal-Tiny, Focal-Small and Focal-Base with different model capacities.
dataset ImageNet-1K, with augmentation and regularization as DeiT
optimizer AdamW: batchsize=1024, 300 epochs, init lr=1e-3, weigh decay=0.05, linear warm-up 20 epochs, cosine decay
stochastic depth 0.2, 0.2, 0.3 for Focal-T, Focal-S, Focal-B
max gradient norm clipped to 5.0
Comparison of image classification on ImageNet-1K for different models. Except for ViT-Base/16, all other models are trained and evaluated on 224x224 resolution.
framework Mask R-CNN, Cascade Mask R-CNN
dataset COCO 2017
optimizer AdamW: 12 or 36 epochs, init lr=1e-4, weigh decay=0.05
stochastic depth 0.2, 0.2, 0.3 for Focal-T, Focal-S, Focal-B
Comparisons with CNN and Transformer baselines and SoTA methods on COCO object detection. The box mAP ( A P b AP^b APb) and mask mAP ( A P m AP^m APm) are reported for RetinaNet and Mask R-CNN trained with 1x schedule.
COCO object detection and segmentation results with RetinaNet and Mask R-CNN. All models are trained with 3x schedule and multi-scale inputs (MS). The numbers before and after “/” at column 2 and 3 are the model size and complexity for RetinaNet and Mask R-CNN, respectively.
dataset COCO 2017
optimizer AdamW: 36 epochs, init lr=1e-4, weigh decay=0.05
stochastic depth 0.2, 0.2, 0.3 for Focal-T, Focal-S, Focal-B
Comparison with ResNet-50, Swin-Tiny across different object detection methods. We use Focal-Tiny as the backbone and train all models using 3x schedule.
dataset ADE20K
optimizer AdamW: batchsize=16, 160K iterations, init lr=6e-5, weigh decay=0.01, polynomial decay
scaling ratio [0.5, 0.75, 1.0, 1.25, 1.5, 1.75], for multi-scale evaluation
Comparison with SoTA methods for semantic segmentation on ADE20K val set. Both single- and multi-scale evaluations are reported at the last two columns. “\neq” means pretrained on ImageNet-22K.
window size
one question is that whether increasing window size further help model learning giving enlarged receptive fields
Impact of different window sizes (WSize). We alter the default size 7 to 14 and observe consistent improvements for both methods.
necessity of window shift
window shift operations enable cross-window interactions between two successive layers
Impact of window shift (W-Shift) on Swin Transformer and Focal Transformer. Tiny models are used.
short- and long-interaction
ablate Focal-Tiny model to
Ablating Focal-Tiny model by adding local, global and both interactions, respectively. Blue bars are for image classification and orange bars indicate object detection performance. Both local and global interactions are essential to obtain good performance.
model depth
since focal attention prompt local and global interactions at each Transformer layer, one question is that whether less number of layers needed to obtain similar modeling capacity as those without global interactions
reduce number of Transformer layers at stage 3 in Swin-Tiny, Focal-Tiny from 6 to 4, 2
Impact of the change of model depth. We gradually reduce the number of transformer layers at the third stage from original 6 to 4 and further 2. It apparently hurts the performance but our Focal Transformers has much slower drop rate than Swin Transformer.