GitHub - lucidrains/vit-pytorch: Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch
Vision Transformer的实现,在视觉分类中只需要一个transformer就能实现SOTA。
不涉及过多的代码,以此为基础进行实验,就可以加快注意力革命。(有点像集成了一个工具?)
基于预训练模型的实验,可参考此处!
pip install vit-pytorch
import torch
from vit_pytorch import ViT
v = ViT(
image_size = 256,
patch_size = 32,
num_classes = 1000,
dim = 1024,
depth = 6,
heads = 16,
mlp_dim = 2048,
dropout = 0.1,
emb_dropout = 0.1
)
img = torch.randn(1, 3, 256, 256)
preds = v(img) # (1, 1000)
image_size: int, 图像为矩形时,应当保证其取值为长宽中的最大值。
patch_size: int,图像划分时的单位尺寸,patch数量为 n=(
image_size//
patch_size)**2,同时,patch的数量必须大于16。
num_classes: int,要分类的数量。(os:这个参数要注意一下)
dim: int,线性变换后输出张量tensor的最后维度 nn.Linear(..., dim)。
depth: int,Transformer块的数量。(Q:Transformer块的概念)
heads: int,多头注意力层的头数量。
mlp_dim: int,MLP(前向)层的维度
channels: int,默认3(RGB),图像的通道数。
dropout: float between [0,1], 默认0。衰退率。
emb_dropout: float between [0,1], 默认0。嵌入衰退率。
pool: string,cls token池化或平均池化。
简单的ViT包含2维余弦位置嵌入,全局平均池化(没有cls token),没有衰退,批处理大小为1024而不是4096,使用了随机增强和混合增强。他们还表明,最后使用一个简单的线性处理所得到的效果与原来的MLP头相比效果无明显差异。
Paper
import torch
from vit_pytorch import SimpleViT
v = SimpleViT(
image_size = 256,
patch_size = 32,
num_classes = 1000,
dim = 1024,
depth = 6,
heads = 16,
mlp_dim = 2048
)
img = torch.randn(1, 3, 256, 256)
preds = v(img) # (1, 1000)
使用蒸馏token从卷积网络提取知识到视觉变压器,可以产生小型和高效的视觉transformer。这个存储库提供了轻松进行蒸馏的方法。
例如. distilling from Resnet50 (or any teacher) to a vision transformer
import torch
from torchvision.models import resnet50
from vit_pytorch.distill import DistillableViT, DistillWrapper
teacher = resnet50(pretrained = True)
v = DistillableViT(
image_size = 256,
patch_size = 32,
num_classes = 1000,
dim = 1024,
depth = 6,
heads = 8,
mlp_dim = 2048,
dropout = 0.1,
emb_dropout = 0.1
)
distiller = DistillWrapper(
student = v,
teacher = teacher,
temperature = 3, # temperature of distillation
alpha = 0.5, # trade between main loss and distillation loss
hard = False # whether to use soft or hard distillation
)
img = torch.randn(2, 3, 256, 256)
labels = torch.randint(0, 1000, (2,))
loss = distiller(img, labels)
loss.backward()
# after lots of training above ...
pred = v(img) # (2, 1000)
除了处理前向传递的方式不同,DistillableViT类与ViT相同,因此在完成蒸馏训练后,能够将参数加载回ViT。
还可以在DistillableViT实例上使用方便的.to_vit方法来返回一个ViT实例。
v = v.to_vit()
type(v) #
研究增加ViT的层数,即网络深度(过去的12层),并建议混合每个头部的注意力后softmax作为一个解决方案,称为重新注意。研究结果与NLP的Talking Heads论文一致。
import torch
from vit_pytorch.deepvit import DeepViT
v = DeepViT(
image_size = 256,
patch_size = 32,
num_classes = 1000,
dim = 1024,
depth = 6,
heads = 16,
mlp_dim = 2048,
dropout = 0.1,
emb_dropout = 0.1
)
img = torch.randn(1, 3, 256, 256)
preds = v(img) # (1, 1000)
指出了更深入训练视觉变压器的困难,并提出了两种解决方案。首先,它提出对剩余块的输出逐通道相乘。其次,它建议让补丁相互关注,只允许CLS令牌关注最后几层的补丁。他们还添加了Talking Heads,提出改进。
import torch
from vit_pytorch.cait import CaiT
v = CaiT(
image_size = 256,
patch_size = 32,
num_classes = 1000,
dim = 1024,
depth = 12, # depth of transformer for patch to patch attention only
cls_depth = 2, # depth of cross attention of CLS tokens to patch
heads = 16,
mlp_dim = 2048,
dropout = 0.1,
emb_dropout = 0.1,
layer_dropout = 0.05 # randomly dropout 5% of the layers
)
img = torch.randn(1, 3, 256, 256)
preds = v(img) # (1, 1000)
提出前两层通过展开对图像序列进行下采样,使每个令牌的图像数据重叠,如图所示。
import torch
from vit_pytorch.t2t import T2TViT
v = T2TViT(
dim = 512,
image_size = 224,
depth = 5,
heads = 8,
mlp_dim = 512,
num_classes = 1000,
t2t_layers = ((7, 4), (3, 2), (3, 2)) # tuples of the kernel size and stride of each consecutive layers of the initial token to token module
)
img = torch.randn(1, 3, 224, 224)
preds = v(img) # (1, 1000)
CCT提出了使用卷积而不是补丁和执行序列池的紧凑变压器。这使得CCT具有高精度和低数量的参数。
import torch
from vit_pytorch.cct import CCT
cct = CCT(
img_size = (224, 448),
embedding_dim = 384,
n_conv_layers = 2,
kernel_size = 7,
stride = 2,
padding = 3,
pooling_kernel_size = 3,
pooling_stride = 2,
pooling_padding = 1,
num_layers = 14,
num_heads = 6,
mlp_radio = 3.,
num_classes = 1000,
positional_embedding = 'learnable', # ['sine', 'learnable', 'none']
)
img = torch.randn(1, 3, 224, 448)
pred = cct(img) # (1, 1000)
或者,也可以使用几个预定义的模型[2,4,6,7,8,14,16],这些模型预先定义了层数、注意头数量、mlp比例和嵌入维度。
import torch
from vit_pytorch.cct import cct_14
cct = cct_14(
img_size = 224,
n_conv_layers = 1,
kernel_size = 7,
stride = 2,
padding = 3,
pooling_kernel_size = 3,
pooling_stride = 2,
pooling_padding = 1,
num_classes = 1000,
positional_embedding = 'learnable', # ['sine', 'learnable', 'none']
)
本文提出用两个视觉transformer对图像进行不同尺度的处理,每隔一段时间交叉处理一个图像。它们展示了在基本视觉转换器上的改进。
import torch
from vit_pytorch.cross_vit import CrossViT
v = CrossViT(
image_size = 256,
num_classes = 1000,
depth = 4, # number of multi-scale encoding blocks
sm_dim = 192, # high res dimension
sm_patch_size = 16, # high res patch size (should be smaller than lg_patch_size)
sm_enc_depth = 2, # high res depth
sm_enc_heads = 8, # high res heads
sm_enc_mlp_dim = 2048, # high res feedforward dimension
lg_dim = 384, # low res dimension
lg_patch_size = 64, # low res patch size
lg_enc_depth = 3, # low res depth
lg_enc_heads = 8, # low res heads
lg_enc_mlp_dim = 2048, # low res feedforward dimensions
cross_attn_depth = 2, # cross attention rounds
cross_attn_heads = 8, # cross attention heads
dropout = 0.1,
emb_dropout = 0.1
)
img = torch.randn(1, 3, 256, 256)
pred = v(img) # (1, 1000)
提出通过使用深度卷积的池化过程向下采样令牌。
import torch
from vit_pytorch.pit import PiT
v = PiT(
image_size = 224,
patch_size = 14,
dim = 256,
num_classes = 1000,
depth = (3, 3, 3), # list of depths, indicating the number of rounds of each stage before a downsample
heads = 16,
mlp_dim = 2048,
dropout = 0.1,
emb_dropout = 0.1
)
# forward pass now returns predictions and the attention maps
img = torch.randn(1, 3, 224, 224)
preds = v(img) # (1, 1000)