MUREN(Relational Context Learning for Human-Object Interaction Detection)

MUREN(Relational Context Learning for Human-Object Interaction Detection)

Contributions

  1. multiplex relation embedding module, which generates context information using unary, pairwise and ternary relations in an HOI instance.
  2. attentive fusion module that propagates requisite context information for context exchange.
  3. three-branch architecture to learn more discriminative features for sub-tasks(human detection, object detection and interaction classification)
  4. MUREN outperforms SOTA methods on HICO-DET and V-COCO

Introduction

  • single-branch(single transformer decoder)

    • update a token set through a single transformer decoder
    • detect HOI instances using the subsequent FFNs directly
    • disadvantages:
      • a single transformer decoder is responsible for all sub-tasks(human detection, object detection, and interaction classification )
      • limited in adapting to the different subtasks with multi-task learning, simultaneously
  • two-branch(two separated transformer decoder)

    • one detects human-object pairs
    • the other classifies interaction classes
    • disadvantages:
      • the insufficient context exchange between the branches prevents the two-branch methods [15,38,40] from learning relational contexts , which plays a crucial role in identifying HOI instances
      • Some tackle this issue with additional context exchange, but they are limited to propagating human-object context to interaction context
  • MUREN

    • advantages:
      • performs rich context exchange
        • three types: using unary, pairwise and ternary relations of human, object and interaction tokens
          • unary and pairwise relation contexts provide more fine-grained information
            • unary contexts: riding helps to infer pair of a human and an interaction
            • pairwise context(human and riding) helps to detect an object(bicycle)
            • multiplex relation embedding module constructs the context information(consists of the three relation contexts)
          • ternary contexts provide holistic information

总结

参考目前的静态图片HOI的SOTA模型(MUREN),针对以往的1/2 branch的decoder的不足,它提出了three-branch architecture(human detection, object detection, interaction classification)

  1. single-branch的缺点:单个transformer decoder需要解决多个子任务,多任务学习效果较差。
  2. two-branch的缺点:两个独立的transformer decoder分别用于检测人物对 和 交互分类,由于两个branch之间的上下文交换不足,无法学习关系上下文。

MUREN(Relational Context Learning for Human-Object Interaction Detection)_第1张图片

MUREN模型架构

  1. 首先利用CNN提取特征,并进行位置编码,扁平化后传入transformer encoder去提取Tokens。
  2. 三个branch的decoder提取task-specific tokens去预测sub-task
  3. MURE的输入为task-specific tokens,为关系推理 生成multiplex relation context
  4. Attentive Fusion Module将multiplex relation context传播到每个子任务上,进行上下文交换
  5. 每个branch最后一层的输出用来预测HOI人物交互对

MUREN(Relational Context Learning for Human-Object Interaction Detection)_第2张图片

MURE架构

由于task-specific tokens均由独立分支生成,那么就会缺少关系上下文信息,为了解决这个问题,Multiplex Relation Embedding Module(MURE)为关系推理生成多重关系上下文,包括了一元、二元和三元关系上下文。其输入是 i-th task-specific tokens 和 image tokens,输出将会到attentive fusion module中进行上下文交换。

Attentive Fusion

目的:传播多重关系上下文给task-specific tokens

由于每个子任务需要不同的上下文信息进行关系推理,因此使用MLP(with each task-specific token)对多重关系上下文与进行转换,从而传播针对每个子任务条件的上下文信息。然后利用channel attention为每个子任务选择出必要的上下文信息。具体公式如下:

pCXNCq0.md.png

最终预测出人物交互对的公式:

MUREN(Relational Context Learning for Human-Object Interaction Detection)_第3张图片

你可能感兴趣的:(论文笔记,计算机视觉,计算机视觉,深度学习)