【refer】Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model

【Introduction】

【Related Work】

【Model Architecture】


【refer】Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model_第1张图片
Overview

A. Attentive Convolutional LSTM

Long Short-Term Memory networks (LSTM) 适用于时间依赖性强的问题。在saliency model中用conventional 操作代替LSTM中的点积。工作方式:根据3个sigmoid gate顺序更新internal state。类似CNN,每次更新都关注图像不同的区域。且每次的输入Xt先和上一次的hidden state H(t-1)做卷积送入tanh activation function后再和一个kernel做卷积操作。之后经过softmax得到attention map每个location的element,再和X点积得到LSTM层的输入Xbar。最后可以得到渐进式准确的saliency 区域(有不同区域权重改变和不同salient region的出现和消失,在这个过程中,预测逐渐准确)

这个过程有512个channel。

B. Learned Priors

心理学研究:biased toward the center现象(This phenomenon is mainly due to the tendency of photographers to position objects of interest at the center of the image.) Indeed, when there are not highly salient regions, humans are inclined to look at the center of the image.

该模型加入了假设为学得means和variance的2D高斯分布的 center priors (diagonal covariance matrix表示)来 model the center bias 。

这个过程有16个channel。至此,总共有528个channel。

C. Dilated Convolutional Network

CNN在提取特征时进行的下采样会影响预测精度,新模型通过减少stride和adding dilation来提高CNN输出的分辨率。Paper中涉及到两个模型:

[VGG-16] 13 convolutional layers and 3 fully connected layers. The convolutional layers are divided in five convolutional blocks where, each of them is followed by a max-pooling layer with a stride of 2. 

[ResNet-50] five convolutional blocks and a fully connected layer.

 a series of residual mappings between blocks composed by a few stacked layers. 一方面避免了随着层数增加精度降低,另一方面提高了网络提取特征的能力。Network的目的是找到feature map,因此:

[VGG-16]取消最后一层max pooling

[ResNet-50]remove the stride and we introduce dilated convolutions in the last two blocks

此时输出为2048channel。为了减少the number of feature maps, 先通过有512个filter的卷积层。

在Prior层也有同样的5*5卷积,之后再通过1*1卷积得到saliency map。

D.  Loss function

线性组合三个常用saliency模型评价metric——Normalized Scanpath Saliency (NSS), the Linear Correlation Coefficient (CC) and the Kullback-Leibler Divergence (KL-Div)

NSS:The idea is to quantify the saliency map values at the eye fixation locations and to normalize it with the saliency map variance

CC:is the Pearson’s correlation coefficient and treats the saliency and groundtruth density maps

KL-Div:the loss of information

The AUC metrics do not penalize low-valued false positives giving an high score for high-valued predictions placed at fixated locations and ignoring the others. Besides, the sAUC is designed to penalize models that take into account the center bias present in eye fixations. The NSS, instead, is sensitive in an equivalent manner to both false positives and false negatives.

【Dataset】

你可能感兴趣的:(【refer】Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model)