搜索论文: Context-Aware Visual Policy Network for Fine-Grained Image Captioning
搜索论文: http://www.studyai.com/search/whole-site/?q=Context-Aware+Visual+Policy+Network+for+Fine-Grained+Image+Captioning
Visualization; Task analysis; Cognition; Decision making; Training; Natural languages; Reinforcement learning; Image captioning; reinforcement learning; visual context; policy network
机器学习; 机器视觉; 强化学习
强化学习; 策略优化; Actor-Critic; 细粒度视觉; 年龄估计; 视觉(频)字幕; 注意力机制
With the maturity of visual detection techniques, we are more ambitious in describing visual content with open-vocabulary, fine-grained and free-form language, i.e., the task of image captioning.
随着视觉检测技术的成熟,我们在用开放的词汇表、细粒度和自由形式的语言描述视觉内容方面有了更大的雄心,即图像字幕的任务。.
In particular, we are interested in generating longer, richer and more fine-grained sentences and paragraphs as image descriptions.
特别是,我们对生成更长、更丰富、更细粒度的句子和段落作为图像描述感兴趣。.
Image captioning can be translated to the task of sequential language prediction given visual content, where the output sequence forms natural language description with plausible grammar.
图像字幕可以转化为给定视觉内容的顺序语言预测任务,其中输出序列形成具有合理语法的自然语言描述。.
However, existing image captioning methods focus only on language policy while not visual policy, and thus fail to capture visual context that are crucial for compositional reasoning such as object relationships (e.g., “man riding horse”) and visual comparisons (e.g., “small(er) cat”).
然而,现有的图像字幕方法只关注语言策略,而不关注视觉策略,因此无法捕捉对合成推理至关重要的视觉上下文,例如对象关系(如“骑马人”)和视觉比较(如“小型(er)猫”)。.
This issue is especially severe when generating longer sequences such as a paragraph.
当生成较长的序列(例如段落)时,这个问题尤其严重。.
To fill the gap, we propose a Context-Aware Visual Policy network (CAVP) for fine-grained image-to-language generation: image sentence captioning and image paragraph captioning.
为了填补这一空白,我们提出了一种用于细粒度图像到语言生成的上下文感知视觉策略网络(CAVP):图像句子字幕和图像段落字幕。.
During captioning, CAVP explicitly considers the previous visual attentions as context, and decides whether the context is used for the current word/sentence generation given the current visual attention.
在字幕制作过程中,CAVP明确地将之前的视觉注意作为上下文,并决定在当前视觉注意的情况下,上下文是否用于当前单词/句子的生成。.
Compared against traditional visual attention mechanism that only fixes a single visual region at each step, CAVP can attend to complex visual compositions over time.
与传统的视觉注意机制在每一步只固定一个视觉区域相比,CAVP可以处理复杂的视觉成分。.
The whole image captioning model—CAVP and its subsequent language policy network—can be efficiently optimized end-to-end by using an actor-critic policy gradient method.
采用演员-评论家策略梯度法,可以对整个图像字幕模型CAVP及其后续语言策略网络进行端到端的有效优化。.
We have demonstrated the effectiveness of CAVP by state-of-the-art performances on MS-COCO and Stanford captioning datasets, using various metrics and sensible visualizations of qualitative visual context…
我们已经通过在MS-COCO和斯坦福字幕数据集上最先进的表演,使用各种度量和定性视觉环境的合理可视化,证明了CAVP的有效性。。.
[‘Zheng-Jun Zha’, ‘Daqing Liu’, ‘Hanwang Zhang’, ‘Yongdong Zhang’, ‘Feng Wu’]