Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models

CVPR2018的一篇关于跨媒体检索的文章,paper链接https://arxiv.org/abs/1711.06420,一作是南洋理工大学的PHD,作者的homepage http://jxgu.cc/,code已经被released出来了https://github.com/ujiuxiang/NLP_Practice.PyTorch/tree/master/cross_modal_retrieval。
个人瞎扯:看这篇文章的三个原因。

  • 1.这是我见过的第一篇同时利用GAN和Reinforcement Learning(RL)做跨媒体检索的文章。
  • 2.这个神奇的网络可以同时做三件跨媒体的任务。cross-media retrieval,image caption and text-to-image synthesis(对于后两个任务,文章只给出了可视化的结果,没有给出定量的分析)。
  • 3.这篇文章发表在CVPR2018上并是Spotlight,而且在MSCOCO上面cross-media retrieval的性能是state-of-the-art。

文章要做的事情(cross-media retrieval):
输入:image(sentence)+dataset      输出:sentence(image)rank list
文章show的可视化的实验结果如下所示。
Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models_第1张图片
与state-of-the-art方法对比结果如下所示。
Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models_第2张图片
文章中的ablation study实验如下所示。
Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models_第3张图片

method
paper的framework如下所示。
framework
文章主要分为三个部分multi-modal feature embedding (the entire upper part),image-to-text generative feature learning (the blue path) and text-to-image generative adversarial feature learning (the green path)。

multi-modal feature embedding:
image encoding: 首先将sentence用one-hot vector进行表示,然后用一个matrix对one-hot vector做word embedding,然后在用两层双向的GRU对word embedding进行表示。
sentence encoding: pre-trained CNN on ImageNet。(image encoding和sentence encoding分别提取high-level abstract features and detailed grounded features。)
feature embedding loss:分别对high-level abstract features and detailed grounded features做带有order-violation penalty [ https://arxiv.org/abs/1511.06361 ] 的 two branches ranking loss。

image-to-text generative feature learning:
先利用CNN将image encode成detailed grounded features,然后利用RNN将detailed grounded features decode成sentence,然后通过loss function使得decode出来的sentence与image所对应的sentence尽可能是相似。
loss function: cross-entropy (XE) loss + reinforcement learning (RL) loss [ https://arxiv.org/abs/1612.00563 ]

  • XE: word-level cost。根据decoder来predict work,使得ground-truth work的概率尽可能大。
  • RL: sentence-level cost。将decoder出来的sentence与ground-truth sentence通过metric(BLEU or CIDEr)计算score,根据score给reward。

text-to-image generative adversarial feature learning:
先用RNN将sentence encode成detailed grounded features,然后再利用conditioning augmentation [ https://arxiv.org/abs/1612.03242 ] 方法将detailed grounded features compress到lower dimension来augment data以致enforce smoothness,然后将augmented data与noise vertor进行concatenation操作,最后利用concatenation vector通过text-to-image synthesis model [ https://arxiv.org/abs/1605.05396 ] 生成图像,在training的过程中为了尽一步enforce smoothness and avoid overfitting,在generator段加入Kullback-Leibler divergence (KL divergence)。

framework training:
文章的training的过程是先train image-to-text generative feature learning和text-to-image generative adversarial feature learning(先train discriminator再train generator),然后再train multi-modal feature embedding。
文章中给出的Pseudocode如下所示。
Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models_第4张图片

总结:

  • 这篇文章的本质是rank loss+双向交叉的autoencoder [ https://people.cs.clemson.edu/~jzwang/1501863/mm2014/p7-feng.pdf ] ,起到主要作用的还是rank loss,双向交叉的autoencoder可以学到比较好的common space的feature。
  • 文章中GAN来源于text-to-image synthesis,reinforcement learning (RL) 本质上来源于image caption,GAN和RL用来learn额外的feature,并没有promote rank loss。
  • 文章的rank loss采用order-violation penalty rank loss有些指标比hard triplet(VSE++)要好(不知道sittting是不是一样,或者调参的技巧会不会有差异),有些差不多,因此对于retrieval,hard triplet并不一定是最好的loss。

你可能感兴趣的:(跨媒体,cross-media,retrieval)