论文题目:Aster:An attentional scene Text Recognizer with Flexible Rectification
论文中用了SPN网络进行弯曲文字图像的校正,其SPN核心idea为:model spatical transform as a learnable network.
空间变换基础
利用矩阵变换实现平移、旋转、缩放
2D图像
平移:
旋转:
缩放:
所以对于一副2D图像来说,只需要 6个参数就能实现对图像的平移、缩放、旋转等变换.即:
Aster中的SPN结构
在Aster中,作者应用SPN网络对输入文字图像进校正.其主要结构如图所示:
- 首先将输入图片resize.
- 通过Localization Network去定位图像中的K个控制点(control points),可以理解为图像中文字区域上下边界点.
- 利用我们的先验知识,校正后的文字区域应该平整地位于图像中心区域,那我们可以得到预设的校正后的K个控制点(control points)的坐标(The control points on the output image are placed at fixed locations along the top
and bottom image borders with equal spacings. ) - 利用与的位置关系构建变换矩阵,利用P将原图的每一个像素变化为 rectified Image.
符号定义:
原始输入图像为
经过resize,尺寸变小的图像记作:
校正后的图像记作
Localization Network
aster 利用Localization Network隐形地去学习K个控制点.
对于原始图I,其控制点集被定义为:,其中代表第k个控制点的坐标.
对于校正后的图像的控制点集同样定义;
localization network 几层卷积+max-pooling 然后最后一层是全连接网络,网络输出size 为2K,代表了K个控制点坐标.
The network consists of a few convolutional layers, with max-pooling layers inserted between them. The output layer is a fully-connected layer whose output size is 2K.
利用localization network直接回归得到
代码部分:
class LocalizationNetwork(nn.Module):
""" Localization Network of RARE, which predicts C' (K x 2) from I (I_width x I_height) """
def __init__(self, F, I_channel_num):
super(LocalizationNetwork, self).__init__()
self.F = F
self.I_channel_num = I_channel_num
self.conv = nn.Sequential(
nn.Conv2d(in_channels=self.I_channel_num, out_channels=64, kernel_size=3, stride=1, padding=1,
bias=False), nn.BatchNorm2d(64), nn.ReLU(True),
nn.MaxPool2d(2, 2), # batch_size x 64 x I_height/2 x I_width/2
nn.Conv2d(64, 128, 3, 1, 1, bias=False), nn.BatchNorm2d(128), nn.ReLU(True),
nn.MaxPool2d(2, 2), # batch_size x 128 x I_height/4 x I_width/4
nn.Conv2d(128, 256, 3, 1, 1, bias=False), nn.BatchNorm2d(256), nn.ReLU(True),
nn.MaxPool2d(2, 2), # batch_size x 256 x I_height/8 x I_width/8
nn.Conv2d(256, 512, 3, 1, 1, bias=False), nn.BatchNorm2d(512), nn.ReLU(True),
nn.AdaptiveAvgPool2d(1) # batch_size x 512
)
self.localization_fc1 = nn.Sequential(nn.Linear(512, 256), nn.ReLU(True))
self.localization_fc2 = nn.Linear(256, self.F * 2)
# Init fc2 in LocalizationNetwork
self.localization_fc2.weight.data.fill_(0)
""" see RARE paper Fig. 6 (a) """
ctrl_pts_x = np.linspace(-1.0, 1.0, int(F / 2))
ctrl_pts_y_top = np.linspace(0.0, -1.0, num=int(F / 2))
ctrl_pts_y_bottom = np.linspace(1.0, 0.0, num=int(F / 2))
ctrl_pts_top = np.stack([ctrl_pts_x, ctrl_pts_y_top], axis=1)
ctrl_pts_bottom = np.stack([ctrl_pts_x, ctrl_pts_y_bottom], axis=1)
initial_bias = np.concatenate([ctrl_pts_top, ctrl_pts_bottom], axis=0)
self.localization_fc2.bias.data = torch.from_numpy(initial_bias).float().view(-1)
def init_weights(self,pretrained=None):
if pretrained is None:
for m in self.conv.modules():
if isinstance(m, nn.Conv2d):
kaiming_init(m)
elif isinstance(m, nn.BatchNorm2d):
constant_init(m, 1)
elif isinstance(m, nn.Linear):
normal_init(m, std=0.01)
elif isinstance(pretrained,str):
##TODO:load pretrain model from pth
pass
else:
raise TypeError('pretrained must be a str or None')
def forward(self, batch_I):
"""
input: batch_I : Batch Input Image [batch_size x I_channel_num x I_height x I_width]
output: batch_C_prime : Predicted coordinates of fiducial points for input batch [batch_size x F x 2]
"""
batch_size = batch_I.size(0)
features = self.conv(batch_I).view(batch_size, -1)
batch_C_prime = self.localization_fc2(self.localization_fc1(features)).view(batch_size, self.F, 2)
return batch_C_prime
grid generator
如图所示,我们期望通过学习出来的与预设的位置关系学习到一个变换矩阵T,使得我们能够应用变换矩阵T到整个输入图,从而得到校正后的图像
前面我们知道只要6个参数就可以对2d图像进行变换.所以我们定义了 2D TPS transformation矩阵
为一个的一个矩阵:
其中(与前面K个控制点相对应上,充分利用控制点信息)
给定一个2D 的点 ,TPS 通过线性投影映射T得到其变换后的点
其中代表了核函数计算像素点p与控制点的欧式距离.
我们通过 与之间的线性映射关系来得到变换矩阵T,那么
其中边界值的设定为:
根据以上公式,我们得到:
其中是一个K*K的矩阵,,
通过计算,我们可以求得变换矩阵T:
所以整个gird-generator的结构示意图如下所示:
通过localization network回归得到的与预设的C构建T矩阵,然后再将T矩阵应用到每个像素p上得到校正后图像;
代码部分:
class GridGenerator(nn.Module):
""" Grid Generator of RARE, which produces P_prime by multipling T with P """
def __init__(self, F, I_r_size):
""" Generate P_hat and inv_delta_C for later """
super(GridGenerator, self).__init__()
self.eps = 1e-6
self.I_r_height, self.I_r_width = I_r_size
self.F = F
self.C = self._build_C(self.F) # F x 2
self.P = self._build_P(self.I_r_width, self.I_r_height)
## for multi-gpu, you need register buffer
self.register_buffer("inv_delta_C", torch.tensor(self._build_inv_delta_C(self.F, self.C)).float()) # F+3 x F+3
self.register_buffer("P_hat", torch.tensor(self._build_P_hat(self.F, self.C, self.P)).float()) # n x F+3
## for fine-tuning with different image width, you may use below instead of self.register_buffer
# self.inv_delta_C = torch.tensor(self._build_inv_delta_C(self.F, self.C)).float().cuda() # F+3 x F+3
# self.P_hat = torch.tensor(self._build_P_hat(self.F, self.C, self.P)).float().cuda() # n x F+3
def _build_C(self, F):
""" Return coordinates of fiducial points in I_r; C """
ctrl_pts_x = np.linspace(-1.0, 1.0, int(F / 2))
ctrl_pts_y_top = -1 * np.ones(int(F / 2))
ctrl_pts_y_bottom = np.ones(int(F / 2))
ctrl_pts_top = np.stack([ctrl_pts_x, ctrl_pts_y_top], axis=1)
ctrl_pts_bottom = np.stack([ctrl_pts_x, ctrl_pts_y_bottom], axis=1)
C = np.concatenate([ctrl_pts_top, ctrl_pts_bottom], axis=0)
return C # F x 2
def _build_inv_delta_C(self, F, C):
""" Return inv_delta_C which is needed to calculate T """
hat_C = np.zeros((F, F), dtype=float) # F x F
for i in range(0, F):
for j in range(i, F):
r = np.linalg.norm(C[i] - C[j])
hat_C[i, j] = r
hat_C[j, i] = r
np.fill_diagonal(hat_C, 1)
hat_C = (hat_C ** 2) * np.log(hat_C)
# print(C.shape, hat_C.shape)
delta_C = np.concatenate( # F+3 x F+3
[
np.concatenate([np.ones((F, 1)), C, hat_C], axis=1), # F x F+3
np.concatenate([np.zeros((2, 3)), np.transpose(C)], axis=1), # 2 x F+3
np.concatenate([np.zeros((1, 3)), np.ones((1, F))], axis=1) # 1 x F+3
],
axis=0
)
inv_delta_C = np.linalg.inv(delta_C)
return inv_delta_C # F+3 x F+3
def _build_P(self, I_r_width, I_r_height):
I_r_grid_x = (np.arange(-I_r_width, I_r_width, 2) + 1.0) / I_r_width # self.I_r_width
I_r_grid_y = (np.arange(-I_r_height, I_r_height, 2) + 1.0) / I_r_height # self.I_r_height
P = np.stack( # self.I_r_width x self.I_r_height x 2
np.meshgrid(I_r_grid_x, I_r_grid_y),
axis=2
)
return P.reshape([-1, 2]) # n (= self.I_r_width x self.I_r_height) x 2
def _build_P_hat(self, F, C, P):
n = P.shape[0] # n (= self.I_r_width x self.I_r_height)
P_tile = np.tile(np.expand_dims(P, axis=1), (1, F, 1)) # n x 2 -> n x 1 x 2 -> n x F x 2
C_tile = np.expand_dims(C, axis=0) # 1 x F x 2
P_diff = P_tile - C_tile # n x F x 2
rbf_norm = np.linalg.norm(P_diff, ord=2, axis=2, keepdims=False) # n x F
rbf = np.multiply(np.square(rbf_norm), np.log(rbf_norm + self.eps)) # n x F
P_hat = np.concatenate([np.ones((n, 1)), P, rbf], axis=1)
return P_hat # n x F+3
def build_P_prime(self, batch_C_prime):
""" Generate Grid from batch_C_prime [batch_size x F x 2] """
device = batch_C_prime.device
batch_size = batch_C_prime.size(0)
batch_inv_delta_C = self.inv_delta_C.repeat(batch_size, 1, 1)
batch_P_hat = self.P_hat.repeat(batch_size, 1, 1)
batch_C_prime_with_zeros = torch.cat((batch_C_prime, torch.zeros(
batch_size, 3, 2).float().to(device)), dim=1) # batch_size x F+3 x 2
batch_T = torch.bmm(batch_inv_delta_C, batch_C_prime_with_zeros) # batch_size x F+3 x 2
batch_P_prime = torch.bmm(batch_P_hat, batch_T) # batch_size x n x 2
return batch_P_prime # batch_size x n x 2
再通过sampler 进行p点采样信息,防止变换后的p点超出边界.
这样,aster完成了对弯曲文本图像的校正.
网络不需要额外标注信息,因为文字识别的loss导致前面cnn梯度信息专注于文字区域本身,这部分信息可以有效利用作为图像校正的loss.
而TPS不用tanh作为激活函数,而是采用了最后一层全连接网络的方式,fc层作为线性激活函数,能在训练阶段保留更多的期望的梯度信息.
一点点小疑惑:
- 在变换中为什么选择了作为核函数,相比其他核函数的优势?
- 变换方式的选择
对于为何它要计算它与所有控制点的欧式距离(为何不是只计算相对应的点的距离)