论文阅读:Why Can’t I Dance in the Mall Learning to Mitigate Scene Bias in Action Recognition

目录

Background

How To Do

网络的整体框架

Result

Question(Things To Do)


 

 

 

论文下载地址:https://arxiv.org/abs/1912.05534

code:https://github.com/vt-vl-lab/SDN

project website:http://chengao.vision/SDN/

 

 

 

Background

论文阅读:Why Can’t I Dance in the Mall Learning to Mitigate Scene Bias in Action Recognition_第1张图片

作者意识到进行 Action Recognition 的 CNN 光识别场景,不需要关注人的动作就可以对其分类,但是这样容易导致 scene bias,从而导致鲁棒性不强。所以他们提出一种类似 GAN 的 debiasing 的动作识别方法,让模型更加关注动作本身,而不是动作所在的场景

 

 

 

How To Do

作者对 Loss 进行了改进,除了有平常对动作进行分类的 Cross Entropy Loss 部分,还设计了两个 Loss:

  1. a scene-adversarial loss
  2. a human mask confusion loss

We use a scene-adversarial loss to obtain scene-invariant feature representation. We use a human mask confusion loss to encourage a network to be unable to predict correct actions when there is no visual evidence.

意思就是作者希望通过 scene-adversarial loss 来学习到一种“场景不变性”(不依赖于场景)的特征;希望通过 human mask confusion loss,让网络无法根据带有 human mask 的视频判断动作的类别,从而迫使网络更集中关注动作本身。

 

首先,作者给出了如何计算一个 Dataset 的 scene representation bias

\phi_{scene} is a scene representation;

M(D, \phi_{scene}) is an action classification accuracy with a scene representation φscene on the dataset D;

M_{rand} is a random chance action classification accuracy on the dataset D

作者还给了一个计算例子:比如用一个在 Places365(一个场景分类数据集) 上 pretrain 好的 ResNet-50 模型,再在其 2048 个维度的特征后接一个线性分类器,对 UCF101(一个动作分类数据集)里的视频进行动作分类的训练,最后正确率可达 59.7%;而对该数据集进行随机分类的正确率约为 1%(1/101)。那么,这个 UCF101 数据集的 scene representation bias 就是 log(59.7/1.0) = 4.09. 一个 completely unbiased 的数据集,其 scene representation bias 应该为 log(1.0/1.0) = 0

 

网络的整体框架

论文阅读:Why Can’t I Dance in the Mall Learning to Mitigate Scene Bias in Action Recognition_第2张图片

本篇文章的训练目标是得到一个尽可能只关注动作本身的 feature extractor G 的参数 \theta _{f}

 

首先是标准的动作分类交叉损失熵函数:

其中,N 是动作类别总数。

 

接着是 scene adversarial loss(L_{Adv},用来 penalizes the model if it could infer the scene types based on the learned representation

其中,M 是场景类别总数。这一项 Loss 越大,说明越无法确定场景的类型; Loss 越小,说明越确定是哪一种场景。所以我们希望这一项 Loss 越大越好。

 

最后是 human mask confusion loss(L_{Ent},用来 penalizes the model if it could predict the actions when the humans in the video are masked-out,也就是说当输入一个 human-masked-out video 时,我们要最大化预测动作分类的信息熵分布。

这一项 Loss 越大,说明越无法确定 human-masked-out video 的动作类别,也正是我们希望的结果:更关注动作本身。

 

那么总的优化过程如下:

论文阅读:Why Can’t I Dance in the Mall Learning to Mitigate Scene Bias in Action Recognition_第3张图片

第一条公式的意思是:我们希望 (动作分类的损失)越小越好,L_{Adv}(场景分类的损失)越大越好,所以 L_{Adv} 前有一个 −λ ,所以总体的 L 越小越好。

第二条公式的意思是:固定场景分类器 \theta _{S}^{*} 的参数时,希望得到能使总体的 L 越小越好的特征提取器 \theta _{f} 和动作分类器 \theta _{A} 参数

第三条公式的意思是(我这里还有点不太明白,同实验室的大神解释):有点像 GAN 中的 Discriminator,主要目的是让模型得到一个分不清场景的特征提取器 \theta _{f} 

文章中提到使用 GRL(gradient reversal layer)来进行这种对抗性的训练,这里待深入了解

当输入是一个 human-masked-out video 时,优化过程如下:

每个 iteration 中,(5)、(6)交替优化,在代码中体现如下:

        # # ------------------------------------------------
        # #  Train using Action CE loss and Place ADV loss
        # # ------------------------------------------------    
        if opt.model == 'vgg':
            inputs_unmasked = inputs_unmasked.squeeze()    
        inputs_unmasked = Variable(inputs_unmasked)
        targets_unmasked = Variable(targets_unmasked)

        if opt.is_mask_entropy:
            if opt.is_place_adv:
                if opt.is_mask_adv:
                    outputs_unmasked, outputs_places_unmasked, outputs_rev_unmasked = model (inputs_unmasked)
                else:
                    outputs_unmasked, outputs_places_unmasked = model(inputs_unmasked)
            else:
                if opt.is_mask_adv:
                    outputs_unmasked, outputs_rev_unmasked = model(inputs_unmasked)
                else:
                    outputs_unmasked = model(inputs_unmasked)
        elif opt.is_mask_cross_entropy:
            if opt.is_place_adv:
                outputs_unmasked, outputs_places_unmasked, outputs_rev_unmasked = model (inputs_unmasked)
            else:
                outputs_unmasked, outputs_rev_unmasked = model(inputs_unmasked)

        loss_act = criterions['action_cross_entropy'](outputs_unmasked, targets_unmasked)
        if opt.is_place_adv:
            loss_place = criterions['places_cross_entropy'](outputs_places_unmasked, places_unmasked)
            if opt.is_place_entropy and epoch>=opt.warm_up_epochs:
                loss_place_entropy = criterions['places_entropy'](outputs_places_unmasked)
                loss = loss_act + loss_place + opt.weight_entropy_loss*loss_place_entropy
            else:
                loss = loss_act + loss_place                                                                            # 为什么是 + loss_place
        else:
            if opt.is_place_entropy and epoch>=opt.warm_up_epochs:
                loss_place_entropy = criterions['places_entropy'](outputs_places_unmasked)
                loss = loss_act + opt.weight_entropy_loss*loss_place_entropy
            else:
                loss = loss_act

        if torch_version < 0.4:
            acc = calculate_accuracy(outputs_unmasked, targets_unmasked)
            losses.update(loss.data[0], inputs_unmasked.size(0))
        else:
            act_acc = calculate_accuracy_pt_0_4(outputs_unmasked, targets_unmasked)
            if opt.is_place_adv:
                if opt.is_place_soft:                
                    _, places_hard_target = torch.max(places_unmasked, 1)
                    place_acc = calculate_accuracy_pt_0_4(outputs_places_unmasked, places_hard_target)
                else:
                    place_acc = calculate_accuracy_pt_0_4(outputs_places_unmasked, places_unmasked)            
                place_losses.update(loss_place.item(), inputs_unmasked.size(0))
            else:                
                place_losses.update(0, inputs_unmasked.size(0))
                            
            losses.update(loss.item(), inputs_unmasked.size(0))
            act_losses.update(loss_act.item(), inputs_unmasked.size(0))            
            if opt.is_place_entropy and epoch>=opt.warm_up_epochs:
                place_entropy_losses.update(loss_place_entropy.item(), inputs_unmasked.size(0))
        act_accuracies.update(act_acc, inputs_unmasked.size(0))
        if opt.is_place_adv:
            place_accuracies.update(place_acc, inputs_unmasked.size(0))
        else:
            place_accuracies.update(0, inputs_unmasked.size(0))

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # -------------------------------------------------
        #  Train using Mask Action Entropy loss (maximize)
        # -------------------------------------------------
        print('num of actual masking_inds = {}/{}'.format(torch.sum(maskings), maskings.shape[0]))

        if opt.model == 'vgg':
            inputs_masked = inputs_masked.squeeze()            
        inputs_masked = Variable(inputs_masked)
        targets_masked = Variable(targets_masked)
        
        if opt.is_mask_entropy:
            if opt.is_place_adv:
                if opt.is_mask_adv:
                    outputs_masked, outputs_places_masked, outputs_rev_masked = model(inputs_masked)
                else:
                    outputs_masked, outputs_places_masked = model(inputs_masked)
            else:
                if opt.is_mask_adv:
                    outputs_masked, outputs_rev_masked = model(inputs_masked)
                else:
                    outputs_masked = model(inputs_masked)
        elif opt.is_mask_cross_entropy:
            if opt.is_place_adv:
                outputs_masked, outputs_places_masked, outputs_rev_masked = model(inputs_masked)
            else:
                outputs_masked, outputs_rev_masked = model(inputs_masked)

        if opt.is_mask_entropy:
            if opt.is_mask_adv:
                loss_action_entropy = criterions['mask_criterion'](outputs_rev_masked)
            else:
                loss_action_entropy = criterions['mask_criterion'](outputs_masked)
            msk_loss = loss_action_entropy
        elif opt.is_mask_cross_entropy:
            loss_action_cross_entropy = criterions['mask_criterion'](outputs_rev_masked, targets_masked)
            msk_loss = loss_action_cross_entropy

        if torch_version < 0.4:
            if opt.is_mask_adv:
                acc = calculate_accuracy(outputs_rev_masked, targets_masked)           
            else:
                acc = calculate_accuracy(outputs_masked, targets_masked)
        else:
            if opt.is_mask_adv:
                msk_act_acc = calculate_accuracy_pt_0_4(outputs_rev_masked, targets_masked)    
            else:
                msk_act_acc = calculate_accuracy_pt_0_4(outputs_masked, targets_masked)                  
        msk_act_losses.update(msk_loss.item(), inputs_masked.size(0))
        msk_act_accuracies.update(msk_act_acc, inputs_masked.size(0))        

        optimizer.zero_grad()
        msk_loss.backward()
        optimizer.step()

 

 

 

Result

论文阅读:Why Can’t I Dance in the Mall Learning to Mitigate Scene Bias in Action Recognition_第4张图片

论文阅读:Why Can’t I Dance in the Mall Learning to Mitigate Scene Bias in Action Recognition_第5张图片

这一幅图最开始有点没看懂,不明白为什么 scene representation bias 越大,反而 debias 后相对提升的效果越小呢?

后来看到后面才知道:

The Pearson correlation is ρ = −0.896 with a p-value 0.006, highlighting a strong negative correlation between the relative performance improvement and the scene representation bias.

Our results show that if a model is pre-trained with debiasing, the model generalizes better to the datasets with less scene bias, as the model pays attention to the actual action.

In contrast, if a model is pre-trained on a dataset with a significant scene bias e.g., Kinetics without any debiasing, the model would be biased towards certain scene context. Such a model may still work well on target dataset with strong scene biases (e.g., UCF-101), but does not generalize well to other less biased target datasets (e.g., Diving48 and HMDB-51).

意思就是说:

模型本身在有着非常大 significant scene bias 的数据集(如 Kinetics)上 pretrain,导致模型会偏向特定的某些场景。这个时候再放到同样有着较大 significant scene bias 的数据集(如 UCF-101)上 fine-tune 的话,也可以得到不错的结果(因为学习到了 scene bias);而放到有着较小 significant scene bias 的数据集(如 Diving48)上 fine-tune 的话,效果会很差,因为可能没有类似的场景。

所以,这个时候如果改成在 debiased 数据集上 pretrain 好的话,再在有着较小 significant scene bias 的数据集(如 Diving48)上 fine-tune,性能会有很大提升,因为已经 debiased 了,没有见过类似的场景也没有关系,relative performance improvement between models trained without and with the proposed debiasing method 就会比较;而在有着较大 significant scene bias 的数据集(如 UCF-101)上 fine-tune 的话,性能提升会相对小一些,因为本身在没有 debiased 的数据集上性能就已经很高了,所以模型即使 debiased 了,提升得也不过多,relative performance improvement between models trained without and with the proposed debiasing method 就会比较

所以总体来说,是更 generalized 的。

 

论文阅读:Why Can’t I Dance in the Mall Learning to Mitigate Scene Bias in Action Recognition_第6张图片

CAM 可视化后的结果

 

 

 

Question(Things To Do)

1、优化函数中的 argmax 在代码中如何体现?

2、总的 Loss 公式中的  −λ 没有体现出来,尤其是负号?

3、深入了解 GRL(gradient reversal layer)

4、CAM

你可能感兴趣的:(学习,人工智能,动作识别,深度学习,人工智能,视频理解,计算机视觉)