Spatial Transformer Networks(STN)代码分析

这是比较早的关于 attention的 文章了。

早且作用大,效果也不错。

关于这篇文章的解读有很多,一找一大堆,就不再赘述。

首先看看文章的解读,看懂原理,然后找到代码,对着看看,明白之后就自己会改了,就可以用到自己需要的地方了。

例如,文章解说和代码可参考:
一个文章解说地址
一个code地址

简单来说,就是在分类之前,先将原图作用于一个变换矩阵得到新的图,再去分类。

所以核心就是
1、得到变换矩阵,一个2*3的矩阵,可以实现平移缩放旋转裁剪等操作。
2、通过变换矩阵得到射变换前后的坐标的映射关系,即grid。
2、原图作用于grid之后得到新图,再卷积输出分类。

一个使用代码如下:


class STNSVHNet(nn.Module):
     def __init__(self, spatial_dim,in_channels, stn_kernel_size, kernel_size, num_classes=10, use_dropout=False):
        super(STNSVHNet, self).__init__()
        self._in_ch = in_channels 
        self._ksize = kernel_size 
        self._sksize = stn_kernel_size
        self.ncls = num_classes 
        self.dropout = use_dropout 
        self.drop_prob = 0.5
        self.stride = 1 
        self.spatial_dim = spatial_dim

        self.stnmod = STNModule.SpatialTransformer(self._in_ch, self.spatial_dim, self._sksize)
        self.conv1 = nn.Conv2d(self._in_ch, 32, kernel_size=self._ksize, stride=self.stride, padding=1, bias=False)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=self._ksize, stride=1, padding=1, bias=False)
        self.conv3 = nn.Conv2d(64, 128, kernel_size=self._ksize, stride=1, padding=1, bias=False)

        self.fc1 = nn.Linear(128*4*4, 3092)
        self.fc2 = nn.Linear(3092, self.ncls)

     def forward(self, x):
        rois, affine_grid = self.stnmod(x)
        out = F.relu(self.conv1(rois))
        out = F.max_pool2d(out, 2)
        out = F.relu(self.conv2(out))
        out = F.max_pool2d(out, 2)
        out = F.relu(self.conv3(out))
        out = out.view(-1, 128*4*4)
        if self.dropout:
           out = F.dropout(self.fc1(out), p=0.5)
        else:
            out = self.fc1(out)
        out = self.fc2(out)
        return out

被调用的STN代如下:


class SpatialTransformer(nn.Module):
    """
    Implements a spatial transformer 
    as proposed in the Jaderberg paper. 
    Comprises of 3 parts:
    1. Localization Net
    2. A grid generator 
    3. A roi pooled module.
    The current implementation uses a very small convolutional net with 
    2 convolutional layers and 2 fully connected layers. Backends 
    can be swapped in favor of VGG, ResNets etc. TTMV
    Returns:
    A roi feature map with the same input spatial dimension as the input feature map. 
    """
    def __init__(self, in_channels, spatial_dims, kernel_size,use_dropout=False):
        super(SpatialTransformer, self).__init__()
        self._h, self._w = spatial_dims 
        self._in_ch = in_channels 
        self._ksize = kernel_size
        self.dropout = use_dropout

        # localization net 
        self.conv1 = nn.Conv2d(in_channels, 32, kernel_size=self._ksize, stride=1, padding=1, bias=False) # size : [1x3x32x32]
        self.conv2 = nn.Conv2d(32, 32, kernel_size=self._ksize, stride=1, padding=1, bias=False)
        self.conv3 = nn.Conv2d(32, 32, kernel_size=self._ksize, stride=1, padding=1, bias=False)
        self.conv4 = nn.Conv2d(32, 32, kernel_size=self._ksize, stride=1, padding=1, bias=False)

        self.fc1 = nn.Linear(32*4*4, 1024)
        self.fc2 = nn.Linear(1024, 6)

    def forward(self, x): 
        """
        Forward pass of the STN module. 
        x -> input feature map 
        """
        batch_images = x
        x = F.relu(self.conv1(x.detach()))
        x = F.relu(self.conv2(x))
        x = F.max_pool2d(x, 2)
        x = F.relu(self.conv3(x))
        x = F.max_pool2d(x,2)
        x = F.relu(self.conv3(x))
        x = F.max_pool2d(x, 2)
        print("Pre view size:{}".format(x.size()))
        x = x.view(-1, 32*4*4)
        if self.dropout:
            x = F.dropout(self.fc1(x), p=0.5)
            x = F.dropout(self.fc2(x), p=0.5)
        else:
            x = self.fc1(x)
            x = self.fc2(x) # params [Nx6]
        
        x = x.view(-1, 2,3) # change it to the 2x3 matrix 
        print(x.size())
        affine_grid_points = F.affine_grid(x, torch.Size((x.size(0), self._in_ch, self._h, self._w)))
        assert(affine_grid_points.size(0) == batch_images.size(0)), "The batch sizes of the input images must be same as the generated grid."
        rois = F.grid_sample(batch_images, affine_grid_points)
        print("rois found to be of size:{}".format(rois.size()))
        return rois, affine_grid_points

核心代码就两句

affine_grid_points = F.affine_grid(x, torch.Size((x.size(0), self._in_ch, self._h, self._w)))
rois = F.grid_sample(batch_images, affine_grid_points)

可以参考这个理解一下:
Pytorch中的仿射变换(affine_grid)

  • batch_images:是原图
  • X:是2*3的变换矩阵,是原图经过一系列卷积等网络结构得到。
  • X后面的参数:表示在仿射变换中的输出的shape,其格式 [N, C, H, W],这里使得输出的size大小维度和原图一致。
  • F.affine_grid:即affine_grid_points 是得到仿射变换前后的坐标的映射关系。返回Shape为 [N, H, W, 2] 的4-D Tensor,表示其中,N、H、W分别为仿射变换中输出feature map的batch size、高和宽。
  • grid_sample:就是将映射关系作用于原图,得到新的图,再将新图进行卷积等操作,输出即可。

因为是有监督学习,所以X会自己学习得到。后面就都有了。

你可能感兴趣的:(Python,机器学习)