LoFTR:Detector-Free Local Feature Matching with Transformers

LoFTR: Detector-Free Local Feature Matching with Transformers

LoFTR:基于Transformer实现局部特征匹配

发表时间:[Submitted on 1 Apr 2021]

发表期刊/会议:Computer Vision and Pattern Recognition

论文地址:https://arxiv.org/abs/2104.00680

代码地址:https://zju3dv.github.io/loftr/


0 摘要

本文提出一种新的图像局部特征匹配方法(关键点匹配);

与传统方法(特征检测-描述符-匹配)不同,本文首先在粗粒度上进行像素级密集匹配然后再细粒度进行优化。

本文在Transformer中使用自注意层(self attention layer)和交叉注意层(cross attention layer)来获取两个图像的特征描述符;

在室内数据集和室外数据集上实验,表明,LoFTR很大程度上优于现在的方法。


1 简介

现有的匹配方法大多数包含三个阶段:特征检测-特征描述-特征匹配;

由于纹理稀疏、纹理重复、视点变化、光照变化等原因,特征检测器可能无法提取足够的特征点(第一阶段就效果不好,后面就不用说了);

最近一些工作通过建立像素级密集匹配来解决这个问题,由于CNN感受野小,效果并不好。

LoFTR:Detector-Free Local Feature Matching with Transformers_第1张图片
图1:LoFTR方法与SuperGlue方法比较。

LoFTR能够在没有纹理的墙壁和地板上找到更多的关键点。基于特征检测的方法SuperGlue找不到可匹配的关键点。



基于以上结果,本文提出一种新的不用检测器的局部特征匹配方法LoFTR,此方法采用self attention和cross attention来处理从CNN中提取的密集局部特征。

  • 利用CNN提取低分辨率特征;
  • 特征经过LoFTR模块得到重构特征:
    • Transformer encoder;
    • Positional Encoding;
    • Self attention;
    • Cross attention;
  • 用得到的特征进行粗粒度匹配;
  • 对得到的粗粒度用阈值筛选,使用基于相关性的方法细化到像素级;
  • 在细粒度(像素级)再次进行匹配;


2 相关工作

基于检测器的局部特征匹配方面的工作:典型如SIFT、ORB;

无检测器局部特征匹配方面的工作:如SIFT FLOW、NCNet;

Transformer方面的相关工作;


3 方法

3.1 局部特征提取

CNN具有局部性和平移等变性归纳偏置,适合提取局部特征。使用FPN(也是CNN的一种)来从图像对 I A I^A IA I B I^B IB中提取多级特征:

  • F ^ A \widehat{F}^A F A F ^ B \widehat{F}^B F B分别代表用FPN提取的细粒度特征(原图的1/2,降低计算成本);
  • F ~ A \widetilde{F}^A F A F ~ B \widetilde{F}^B F B分别代表用FPN提取的粗粒度特征(原图的1/8,降低计算成本);
LoFTR:Detector-Free Local Feature Matching with Transformers_第2张图片
图2.1:LoFTR第一步,CNN提取局部特征。

补充:CNN中的归纳偏置

归纳偏置其实就是一种先验知识,一种提前做好的假设。

在CNN中的归纳偏置一般包括两类:①locality(局部性)和②translation equivariance(平移等变性)

① locality:假设相同的区域会有相同的特征,靠得越近的东西相关性能也就越强。局部性可以控制模型的复杂度。

②translation equivariance:由于卷积核是一样的所以不管图片中的物体移动到哪里,只要是同样的输入进来遇到同样的卷积核,那么输出就是一样的。利用平移等变形可以很好的提高模型的泛化能力。
参考论文:https://arxiv.org/abs/2010.08515



3.2 LoFTR模块

提取到特征 F ~ A \widetilde{F}^A F A F ~ B \widetilde{F}^B F B后,通过LoFTR模块来提取位置特征和上下文特征。经过LoFTR模块后的特征用 F ~ t r A \widetilde{F}^A_{tr} F trA F ~ t r B \widetilde{F}^B_{tr} F trB来表示。

步骤(如图2.2所示):

  • 1.提取到的特征flatten为向量,加上位置编码;
  • 2.self-attention;
  • 3.cross-attention;
  • 4.重复step2和step3 N c N_c Nc次;
  • 5.得到 F ~ t r A \widetilde{F}^A_{tr} F trA F ~ t r B \widetilde{F}^B_{tr} F trB

其实就是完全套到Transformer里;

self-attention:自己人问自己人(图A问图A),不要匹配重复了(见图3(b));

cross-attention:自己人问其他人(图A问图B),能不能匹配上(见图3(b));

LoFTR:Detector-Free Local Feature Matching with Transformers_第3张图片
图3(b):可视化结果。

Feature Visualization:将得到的特征经过PCA降维后,在RGB空间可视化。可以看出,两张图像同种颜色为图一区域,证明特征提取的好。



流程展示:

LoFTR:Detector-Free Local Feature Matching with Transformers_第4张图片
图2.2:LoFTR模块。


3.3 粗粒度匹配

  • 计算特征向量直接的匹配概率 P c ( i , j ) P_c(i,j) Pc(i,j)
  • 根据阈值 θ c θ_c θc和相互最近邻算法(MNN)来过滤可能的异常值;

假设:

经过Transformer得到区域60 × 80个(非像素级的);

F ~ t r A \widetilde{F}^A_{tr} F trA特征向量长度:60 × 80 = 4800;

  1. 特征之间的得分矩阵S计算如下(向量内积),S大小为4800;
  1. softmax表示对得分矩阵S进行归一化,得分矩阵进行内积, P c ( i , j ) 大小为 4800 × 4800 P_c(i,j)大小为4800 × 4800 Pc(i,j)大小为4800×4800
  1. 置信度大于阈值 θ c θ_c θc并且是相互最近邻(MNN);
  1. 如下图所示:
LoFTR:Detector-Free Local Feature Matching with Transformers_第5张图片
图2.3:计算粗粒度匹配。
  • 输入图像大小为 H × W,降采样8倍,图像大小变为 ( 1 8 × H ) × ( 1 8 × W ) (\frac{1}{8}×H)×(\frac{1}{8}×W) (81×H)×(81×W)
  • flatten成特征向量后长度为 ( 1 8 ) 2 × H × W (\frac{1}{8})^2×H×W (81)2×H×W,故 P c P_c Pc矩阵边长为 ( 1 8 ) 2 × H × W (\frac{1}{8})^2×H×W (81)2×H×W P c P_c Pc矩阵大小为 ( ( 1 8 ) 2 × H × W ) 2 ((\frac{1}{8})^2×H×W)^2 ((81)2×H×W)2

代码:

    def forward(self, feat_c0, feat_c1, data, mask_c0=None, mask_c1=None):

        N, L, S, C = feat_c0.size(0), feat_c0.size(1), feat_c1.size(1), feat_c0.size(2)

        # normalize
        # 两个特征
        feat_c0, feat_c1 = map(lambda feat: feat / feat.shape[-1]**.5,
                               [feat_c0, feat_c1])
        print("feat_c0 shape",feat_c0.shape)
        print("feat_c1 shape", feat_c1.shape)


        if self.match_type == 'dual_softmax':
            # 计算矩阵S
            # torch.einsum "nlc,nsc->nls" 矩阵乘法
            sim_matrix = torch.einsum("nlc,nsc->nls", feat_c0,
                                      feat_c1) / self.temperature
            
            if mask_c0 is not None:
                sim_matrix.masked_fill_(
                    ~(mask_c0[..., None] * mask_c1[:, None]).bool(),
                    -INF)
            # 计算矩阵P_c
            conf_matrix = F.softmax(sim_matrix, 1) * F.softmax(sim_matrix, 2)


3.4 细粒度匹配

步骤:

  • 根据粗粒度匹配结果在对应1 / 2图像的对应区域上裁取 w × w w × w w×w大小的块(比如5 × 5的块,这里是像素级了);
  • 将这两个 w × w w × w w×w的块放入上述LoFTR模块中,得到新特征向量 F ^ t r A ( i ^ ) \widehat{F}^A_{tr}(\widehat{i}) F trA(i ) F ^ t r B ( j ^ ) \widehat{F}^B_{tr}(\widehat{j}) F trB(j ),分别是以点 i ^ \widehat{i} i j ^ \widehat{j} j 为中心;
  • 计算 F ^ t r A ( i ^ ) \widehat{F}^A_{tr}(\widehat{i}) F trA(i )的中心点向量与 F ^ t r B ( j ^ ) \widehat{F}^B_{tr}(\widehat{j}) F trB(j )每个向量的相关性(翻译一下:计算图A中心点和图B25个点的相关性)并归一化,得到5 × 5的矩阵;
  • 计算概率分布的期望,得到B图上与 i ^ \widehat{i} i 点最匹配的点 j ^ ′ \widehat{j}' j ,表示为 M f = ( i ^ , j ^ ′ ) M_f = {(\widehat{i},\widehat{j}')} Mf=(i ,j )
LoFTR:Detector-Free Local Feature Matching with Transformers_第6张图片
图2.4:细粒度匹配。

代码

class FineMatching(nn.Module):
    """FineMatching with s2d paradigm"""

    def __init__(self):
        super().__init__()

    def forward(self, feat_f0, feat_f1, data):

        M, WW, C = feat_f0.shape
        # 得到w
        W = int(math.sqrt(WW))
        scale = data['hw0_i'][0] / data['hw0_f'][0]
        self.M, self.W, self.WW, self.C, self.scale = M, W, WW, C, scale

        # M:粗粒度匹配到的数量
        # corner case: if no coarse matches found
        if M == 0:
            assert self.training == False, "M is always >0, when training, see coarse_matching.py"
            # logger.warning('No matches found in coarse-level.')
            data.update({
                'expec_f': torch.empty(0, 3, device=feat_f0.device),
                'mkpts0_f': data['mkpts0_c'],
                'mkpts1_f': data['mkpts1_c'],
            })
            return

        # 中心点
        feat_f0_picked = feat_f0_picked = feat_f0[:, WW//2, :]
        sim_matrix = torch.einsum('mc,mrc->mr', feat_f0_picked, feat_f1)
        softmax_temp = 1. / C**.5
        # heatmap: w * w
        heatmap = torch.softmax(softmax_temp * sim_matrix, dim=1).view(-1, W, W)

        # compute coordinates from heatmap
        # 求期望
        coords_normalized = dsnt.spatial_expectation2d(heatmap[None], True)[0]  # [M, 2]
        # 热图
        grid_normalized = create_meshgrid(W, W, True, heatmap.device).reshape(1, -1, 2)  # [1, WW, 2]

        # compute std over 
        var = torch.sum(grid_normalized**2 * heatmap.view(-1, WW, 1), dim=1) - coords_normalized**2  # [M, 2]
        std = torch.sum(torch.sqrt(torch.clamp(var, min=1e-10)), -1)  # [M]  clamp needed for numerical stability
        
        # for fine-level supervision
        data.update({'expec_f': torch.cat([coords_normalized, std.unsqueeze(1)], -1)})

        # compute absolute kpt coords
        self.get_fine_match(coords_normalized, data)

    @torch.no_grad()
    def get_fine_match(self, coords_normed, data):
        W, WW, C, scale = self.W, self.WW, self.C, self.scale

        # mkpts0_f and mkpts1_f
        mkpts0_f = data['mkpts0_c']
        scale1 = scale * data['scale1'][data['b_ids']] if 'scale0' in data else scale
        mkpts1_f = data['mkpts1_c'] + (coords_normed * (W // 2) * scale1)[:len(data['mconf'])]

        data.update({
            "mkpts0_f": mkpts0_f,
            "mkpts1_f": mkpts1_f
        })


4 代码

4.1 LoFTR pipeline

class LoFTR(nn.Module):
    def __init__(self, config):
        super().__init__()
        # Misc
        self.config = config

        # Modules
        # FPN: resnet as backbone
        self.backbone = build_backbone(config)
        # 位置编码
        self.pos_encoding = PositionEncodingSine(
            config['coarse']['d_model'],
            temp_bug_fix=config['coarse']['temp_bug_fix'])
        # LoFTR module
        self.loftr_coarse = LocalFeatureTransformer(config['coarse'])
        # 粗粒度匹配
        self.coarse_matching = CoarseMatching(config['match_coarse'])
        # 细粒度匹配前的数据预处理
        self.fine_preprocess = FinePreprocess(config)
        # LoFTR module(与粗粒度的配置不同)
        self.loftr_fine = LocalFeatureTransformer(config["fine"])
        # 细粒度匹配
        self.fine_matching = FineMatching()

    def forward(self, data):
        """ 
        Update:
            data (dict): {
                'image0': (torch.Tensor): (N, 1, H, W)
                'image1': (torch.Tensor): (N, 1, H, W)
                'mask0'(optional) : (torch.Tensor): (N, H, W) '0' indicates a padded position
                'mask1'(optional) : (torch.Tensor): (N, H, W)
            }
        """
        # 1. Local Feature CNN
        data.update({
            'bs': data['image0'].size(0),
            'hw0_i': data['image0'].shape[2:], 'hw1_i': data['image1'].shape[2:]
        })

        # input.shape: (640,480)
        if data['hw0_i'] == data['hw1_i']:  # faster & better BN convergence
            # backbone cat 拼接两张图像 提取特征
            feats_c, feats_f = self.backbone(torch.cat([data['image0'], data['image1']], dim=0))
            # feats_c.shape: torch.Size([2,256,60,80]) 1 / 8
            # feats_f.shape: torch.Size([2,128,240,320]) 1 / 2

            # 两张特征图分开
            (feat_c0, feat_c1), (feat_f0, feat_f1) = feats_c.split(data['bs']), feats_f.split(data['bs'])
        else:  # handle different input shapes
            (feat_c0, feat_f0), (feat_c1, feat_f1) = self.backbone(data['image0']), self.backbone(data['image1'])

        data.update({
            'hw0_c': feat_c0.shape[2:], 'hw1_c': feat_c1.shape[2:],
            'hw0_f': feat_f0.shape[2:], 'hw1_f': feat_f1.shape[2:]
        })

        # 2. coarse-level loftr module
        # add featmap with positional encoding, then flatten it to sequence [N, HW, C]
        # 位置编码
        feat_c0 = rearrange(self.pos_encoding(feat_c0), 'n c h w -> n (h w) c')
        feat_c1 = rearrange(self.pos_encoding(feat_c1), 'n c h w -> n (h w) c')

        mask_c0 = mask_c1 = None  # mask is useful in training
        if 'mask0' in data:
            mask_c0, mask_c1 = data['mask0'].flatten(-2), data['mask1'].flatten(-2)
        # 经过LoFTR module编码
        feat_c0, feat_c1 = self.loftr_coarse(feat_c0, feat_c1, mask_c0, mask_c1)

        # 3. match coarse-level 粗粒度匹配
        self.coarse_matching(feat_c0, feat_c1, data, mask_c0=mask_c0, mask_c1=mask_c1)

        # 4. fine-level refinement 细粒度匹配数据处理
        feat_f0_unfold, feat_f1_unfold = self.fine_preprocess(feat_f0, feat_f1, feat_c0, feat_c1, data)
        if feat_f0_unfold.size(0) != 0:  # at least one coarse level predicted
            feat_f0_unfold, feat_f1_unfold = self.loftr_fine(feat_f0_unfold, feat_f1_unfold)

        # 5. match fine-level 细粒度匹配
        self.fine_matching(feat_f0_unfold, feat_f1_unfold, data)

    def load_state_dict(self, state_dict, *args, **kwargs):
        for k in list(state_dict.keys()):
            if k.startswith('matcher.'):
                state_dict[k.replace('matcher.', '', 1)] = state_dict.pop(k)
        return super().load_state_dict(state_dict, *args, **kwargs)

4.2 局部特征提取FPN(对应3.1)

就是一些卷积什么的…

def build_backbone(config):
    if config['backbone_type'] == 'ResNetFPN':
        if config['resolution'] == (8, 2):
            return ResNetFPN_8_2(config['resnetfpn'])
        

class ResNetFPN_8_2(nn.Module):
    """
    ResNet+FPN, output resolution are 1/8 and 1/2.
    Each block has 2 layers.
    """

    def __init__(self, config):
        super().__init__()
        # Config
        block = BasicBlock
        initial_dim = config['initial_dim']
        block_dims = config['block_dims']

        # Class Variable
        self.in_planes = initial_dim

        # Networks
        self.conv1 = nn.Conv2d(1, initial_dim, kernel_size=7, stride=2, padding=3, bias=False)
        self.bn1 = nn.BatchNorm2d(initial_dim)
        self.relu = nn.ReLU(inplace=True)

        self.layer1 = self._make_layer(block, block_dims[0], stride=1)  # 1/2
        self.layer2 = self._make_layer(block, block_dims[1], stride=2)  # 1/4
        self.layer3 = self._make_layer(block, block_dims[2], stride=2)  # 1/8

        # 3. FPN upsample
        self.layer3_outconv = conv1x1(block_dims[2], block_dims[2])
        self.layer2_outconv = conv1x1(block_dims[1], block_dims[2])
        self.layer2_outconv2 = nn.Sequential(
            conv3x3(block_dims[2], block_dims[2]),
            nn.BatchNorm2d(block_dims[2]),
            nn.LeakyReLU(),
            conv3x3(block_dims[2], block_dims[1]),
        )
        self.layer1_outconv = conv1x1(block_dims[0], block_dims[1])
        self.layer1_outconv2 = nn.Sequential(
            conv3x3(block_dims[1], block_dims[1]),
            nn.BatchNorm2d(block_dims[1]),
            nn.LeakyReLU(),
            conv3x3(block_dims[1], block_dims[0]),
        )

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
            elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)

    def _make_layer(self, block, dim, stride=1):
        layer1 = block(self.in_planes, dim, stride=stride)
        layer2 = block(dim, dim, stride=1)
        layers = (layer1, layer2)

        self.in_planes = dim
        return nn.Sequential(*layers)

    def forward(self, x):
        # ResNet Backbone
        x0 = self.relu(self.bn1(self.conv1(x)))
        x1 = self.layer1(x0)  # 1/2
        x2 = self.layer2(x1)  # 1/4
        x3 = self.layer3(x2)  # 1/8

        # FPN
        x3_out = self.layer3_outconv(x3)

        # 上采样
        x3_out_2x = F.interpolate(x3_out, scale_factor=2., mode='bilinear', align_corners=True)
        x2_out = self.layer2_outconv(x2)
        x2_out = self.layer2_outconv2(x2_out+x3_out_2x)

        x2_out_2x = F.interpolate(x2_out, scale_factor=2., mode='bilinear', align_corners=True)
        x1_out = self.layer1_outconv(x1)
        x1_out = self.layer1_outconv2(x1_out+x2_out_2x)

        return [x3_out, x1_out]

4.3 位置编码

# 正弦位置编码
class PositionEncodingSine(nn.Module):


    def __init__(self, d_model, max_shape=(256, 256), temp_bug_fix=True):

        super().__init__()

        pe = torch.zeros((d_model, *max_shape))
        # 256 * 256
        # 1 1 1...
        # 2 2 2...
        # ...
        # 256 256 256...
        y_position = torch.ones(max_shape).cumsum(0).float().unsqueeze(0)
        # # 256 * 256
        # 1 2 3 ...256
        # 1 2 3 ...256
        # 1 2 3 ...256
        x_position = torch.ones(max_shape).cumsum(1).float().unsqueeze(0)

        # div_term: torch.Size([64])
        if temp_bug_fix:
            div_term = torch.exp(torch.arange(0, d_model//2, 2).float() * (-math.log(10000.0) / (d_model//2)))
        else:  # a buggy implementation (for backward compatability only)
            div_term = torch.exp(torch.arange(0, d_model//2, 2).float() * (-math.log(10000.0) / d_model//2))

        div_term = div_term[:, None, None]  # [C//4, 1, 1]
        # pe torch.Size([256, 256, 256])
        pe[0::4, :, :] = torch.sin(x_position * div_term)
        pe[1::4, :, :] = torch.cos(x_position * div_term)
        pe[2::4, :, :] = torch.sin(y_position * div_term)
        pe[3::4, :, :] = torch.cos(y_position * div_term)
        

        self.register_buffer('pe', pe.unsqueeze(0), persistent=False)  # [1, C, H, W]

    def forward(self, x):
        """
        Args:
            x: [N, C, H, W]
        """
        return x + self.pe[:, :, :x.size(2), :x.size(3)]

你可能感兴趣的:(计算机视觉,深度学习,人工智能)