Vision Transformers (ViT)在2020年Dosovitskiy et. al.提出后,在计算机视觉领域逐渐占领主导位置,在图像分类以及目标检测、语义分割等下游任务中获得了很好的性能,掀起transformer系列在CV领域的浪潮。这里将介绍如何从头开始基于tensorflow 框架一步步实现ViT模型。
上一篇我们写了基于pytorch版本实现ViT模型,感兴趣点击这里,应大家要求,今天将实战如何从头开始实现我的第一个 ViT(使用 tensorflow版本),如果你还没有熟悉自然语言处理(NLP)中使用的Transformer模型,可能会对transformer在CV领域的应用有点懵圈,对ViT模型在图像上的使用不明所以,别担心,从这里开始!
因为上一篇已经对ViT模型有一些讲解,这里就直接搭建框架模型,同时上一篇使用的是MNIST 数据集,这里丰富点,使用cifar100数据集,虽然目标简单,但是我们可以基于该图像分类任务理清ViT模型的整个脉络。
首先这里配置的环境是:
python==3.7
tensorflow==2.7.0
tensorflow_addons==0.16.1
首先对需要使用的一些模块导入:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow_addons as tfa # Community-made Tensorflow code (AdamW optimizer)
导入tensorflow_addons模块是为了使用AdamW优化器【1】,当然这里可以使用Adam等其他优化器。
下面我们来创建main函数,因为tensorflow2以后内置keras框架,这里先直接通过keras.datasets模块加载数据集并分割成train和test数据集,用于预处理cifar100数据集,定义学习率、batch_size、epochs等超参;
通过model.compile 定义AdamW优化器,定义loss,评价指标metrics;通过keras.callbacks.ModelCheckpoint 保存训练过程中的权重文件,通过model.fit 执行训练过程,训练100 epochs,然后,在测试集上计算准确率。
def main():
# Downloading dataset
num_classes = 100
input_shape = (32, 32, 3)
(x_train, y_train), (x_test, y_test) = keras.datasets.cifar100.load_data()
print(f"x_train shape: {x_train.shape} - y_train shape: {y_train.shape}")
print(f"x_test shape: {x_test.shape} - y_test shape: {y_test.shape}")
# Hyper-parameters
learning_rate = 0.001
weight_decay = 0.0001
batch_size = 256
num_epochs = 100
def create_vit_classifier():
#TODO
pass
def run_experiment(model):
#define optimizer
optimizer = tfa.optimizers.AdamW(
learning_rate=learning_rate, weight_decay=weight_decay
)
# define optimizer, loss and metrics
model.compile(
optimizer=optimizer,
loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=[
keras.metrics.SparseCategoricalAccuracy(name='accuracy'),
keras.metrics.SparseTopKCategoricalAccuracy(5, name='top-5 accuracy'),
],
)
# saved model path
checkpoint_filepath = './tmp/checkpoint'
# save model when training,and only save the best model by monitor val_accuracy.
checkpoint_callback = keras.callbacks.ModelCheckpoint(
checkpoint_filepath,
monitor="val_accuracy",
save_best_only=True,
save_weights_only=True
)
# train process
history = model.fit(
x=x_train,
y=y_train,
batch_size=batch_size,
epochs=num_epochs,
validation_split=0.1,
callbacks=[checkpoint_callback]
)
# test process
model.load_weights(checkpoint_filepath)
_, accuracy, top5_accuracy = model.evaluate(x_test, y_test)
print(f"Test Accuracy: {accuracy}")
print(f"Test Top-5 Accuracy: {top5_accuracy}")
return history
model = create_vit_classifier()
run_experiment(model)
搭建好整个训练测试框架后,我们现在来主攻create_vit_classifier 函数的搭建,即ViT模型的搭建,模型的任务是对cifar100的图像进行分类。
因tensorflow以及大多数 DL 框架都提供autograd计算,我们只需要将ViT网络中的网络层继承keras的layers类,并定义好在训练框架中的优化器,tensorflow框架将负责反向传播梯度并训练模型的参数,这样我们只需要关心实现 ViT 模型的前向传递过程。上一篇已经对ViT模型有所介绍,这里就粘贴主要的网络图。
在CIfar100数据集上,我们使用keras的layers模块,做数据增强:
image_size = 72
# Data augmentation
data_augmentation = keras.Sequential(
[
layers.experimental.preprocessing.Normalization(),
layers.experimental.preprocessing.Resizing(image_size, image_size),
layers.experimental.preprocessing.RandomFlip("horizontal"),
layers.experimental.preprocessing.RandomRotation(factor=0.02),
layers.experimental.preprocessing.RandomZoom(height_factor=0.2, width_factor=0.2)
],
name='data_augmentation'
)
# Normalizing based on training data
data_augmentation.layers[0].adapt(x_train)
def create_vit_classifier():
# Creating classifier
inputs = layers.Input(shape=input_shape)
augmented = data_augmentation(inputs)
ViT中首先将一张图像分解成多个子图像,将每个子图像映射成一个向量。将图像resize到72x72大小后,将每个的图像分成12x12块,每块大小是6x6(如果不能完全整除分块,需要对图像padding填充),这样我们能从单个图像中获得144个子图像。将原图重塑成:
(N, PxP, H/P x W/P x C) = (N, 12x12, 6x6) = (N, 144, 108)
请注意,虽然每个子图大小为 3x6x6 ,但我们将其展平为 108 维向量。
我们对代码实现上述功能:
class Patches(layers.Layer):
"""Gets some images and returns the patches for each image"""
def __init__(self, patch_size):
super(Patches, self).__init__()
self.patch_size = patch_size
def call(self, images):
batch_size = tf.shape(images)[0]
patches = tf.image.extract_patches(
images=images,
sizes=[1, self.patch_size, self.patch_size, 1],
strides=[1, self.patch_size, self.patch_size, 1],
rates=[1, 1, 1, 1],
padding="VALID",
)
patch_dims = patches.shape[-1]
patches = tf.reshape(patches, [batch_size, -1, patch_dims])
return patches
def create_vit_classifier():
# Creating classifier
inputs = layers.Input(shape=input_shape)
augmented = data_augmentation(inputs)
patches = Patches(patch_size)(augmented)
得到展平后的patches即向量后,通过layers.Dense来改变维度,线性映射可以映射到任意向量大小,我们这里设置为64,当然你可以设置成任意维度。模型维度变成 (N, 36, 64),通过layers.Embedding
添加可学习的位置编码。注:这里有稍许的变更,没有添加分类标记,对应MLP最后的输出也有所改变。主要原因是为了训练快点。。如果大家想加的话,可以在PatchEncoder
类中的call
函数里添加。
projection_dim = 64
class PatchEncoder(layers.Layer):
"""Adding (learnable) positional encoding to input patches"""
def __init__(self, num_patches, projection_dim):
super(PatchEncoder, self).__init__()
self.num_patches = num_patches
self.projection = layers.Dense(units=projection_dim)
self.position_embedding = layers.Embedding(
input_dim=num_patches, output_dim=projection_dim
)
def call(self, patch):
positions = tf.range(start=0, limit=self.num_patches, delta=1)
encoded = self.projection(patch) + self.position_embedding(positions)
return encoded
def create_vit_classifier():
# Creating classifier
inputs = layers.Input(shape=input_shape)
augmented = data_augmentation(inputs)
patches = Patches(patch_size)(augmented)
encoded_patches = PatchEncoder(num_patches, projection_dim)(patches)
在得到encoding结果后,先对token做归一化,然后应用多头注意力机制,最后添加一个残差连接(连接LN 之前的输入和多头注意力之后的输出)。
def create_vit_classifier():
# Creating classifier
inputs = layers.Input(shape=input_shape)
augmented = data_augmentation(inputs)
patches = Patches(patch_size)(augmented)
encoded_patches = PatchEncoder(num_patches, projection_dim)(patches)
for _ in range(transformer_layers):
# Layer normalization and self-attention
x1 = layers.LayerNormalization(epsilon=1e-6)(encoded_patches)
attention_output = layers.MultiHeadAttention(
num_heads, key_dim=projection_dim, dropout=0.1
)(x1, x1)
# Residual conenction
x2 = layers.Add()([attention_output, encoded_patches])
继续下面网络,将当前张量再通过另一个 LN 和 MLP 后,通过残差连接,搭积木这样搭起来,这里相对于pytorch版本使用gelu
激活函数,并使用多层transformer。
def mlp(x, hidden_units, dropout_rate):
"""MLP with dropout and skip-connections"""
for units in hidden_units:
x = layers.Dense(units, activation=tf.nn.gelu)(x)
x = layers.Dropout(dropout_rate)(x)
return x
def create_vit_classifier():
# Creating classifier
inputs = layers.Input(shape=input_shape)
augmented = data_augmentation(inputs)
patches = Patches(patch_size)(augmented)
encoded_patches = PatchEncoder(num_patches, projection_dim)(patches)
for _ in range(transformer_layers):
# Layer normalization and self-attention
x1 = layers.LayerNormalization(epsilon=1e-6)(encoded_patches)
attention_output = layers.MultiHeadAttention(
num_heads, key_dim=projection_dim, dropout=0.1
)(x1, x1)
# Residual conenction
x2 = layers.Add()([attention_output, encoded_patches])
# Normalization and MLP
x3 = layers.LayerNormalization(epsilon=1e-6)(x2)
x3 = mlp(x3, hidden_units=transformer_units, dropout_rate=0.1)
# Residual connection
encoded_patches = layers.Add()([x3, x2])
对输出结果先LN归一化后,再对其展平,为防止过拟合,增加dropout后,加入MLP,因为pytorch使用的是分类标记,MLP输出后,取首位值即为分类结果,这里改进后,后面通过Dense输出。
def create_vit_classifier():
# Creating classifier
inputs = layers.Input(shape=input_shape)
augmented = data_augmentation(inputs)
patches = Patches(patch_size)(augmented)
encoded_patches = PatchEncoder(num_patches, projection_dim)(patches)
for _ in range(transformer_layers):
# Layer normalization and self-attention
x1 = layers.LayerNormalization(epsilon=1e-6)(encoded_patches)
attention_output = layers.MultiHeadAttention(
num_heads, key_dim=projection_dim, dropout=0.1
)(x1, x1)
# Residual conenction
x2 = layers.Add()([attention_output, encoded_patches])
# Normalization and MLP
x3 = layers.LayerNormalization(epsilon=1e-6)(x2)
x3 = mlp(x3, hidden_units=transformer_units, dropout_rate=0.1)
# Residual connection
encoded_patches = layers.Add()([x3, x2])
# Create a [batch_size, projection_dim] tensor
representation = layers.LayerNormalization(epsilon=1e-6)(encoded_patches)
representation = layers.Flatten()(representation)
representation = layers.Dropout(0.5)(representation)
# Add MLP
features = mlp(representation, hidden_units=mlp_head_units, dropout_rate=0.5)
# Classify output
logits = layers.Dense(num_classes)(features)
# create whole net
model = keras.Model(inputs=inputs, outputs=logits)
return model
我们模型的输出现在是一个 (N, 100) 张量。ok,大功告成!
现在来试试我们的模型表现如何,cpu下运行:
嗯,趋势都正确。就是有点慢。完结撒花。
tensorflow版本相对于上一版pytorch版本使用 GeLU 激活函数、将多个 Transformer 编码器块堆叠在一起。后续ViT也有各个版本的改进,大家可以基于此去添加,关注公众号后台回复"vit_tf"获完整代码。
论文:https://arxiv.org/abs/2010.11929
参考:
[1]https://blog.csdn.net/u012744245/article/details/112671504?utm_medium=distribute.pc_relevant.none-task-blog-2defaultbaidujs_utm_term~default-9.pc_relevant_aa&spm=1001.2101.3001.4242.6&utm_relevant_index=12
[2] https://zhuanlan.zhihu.com/p/340149804