《A Model of Saliency-based Visual Attention for Rapid Scene Analysis》翻译和笔记

原文链接:A Model of Saliency-based Visual Attention for Rapid Scene Analysis

以机翻为主,人工校对。

摘要

A visual attention system, inspired by the behavior and the neuronal architecture of the early primate visual system, is presented. Multiscale image features are combined into a single topographical saliency map. A dynamical neural network then selects attended locations in order of decreasing saliency. The system breaks down the complex problem of scene understanding by rapidly selecting, in a computationally efficient manner, conspicuous locations to be analyzed in detail.

受灵长类早期视觉系统的行为和神经结构启发,本文提出了一种视觉注意力系统,将多尺度的图像特征组合成单一的地形状的显著图。然后,使用动态神经网络按照显著性降低的顺序选择注意力位位置。该系统通过快速选择需要详细分析的显著位置,通过高效的计算解决了复杂的场景理解问题。

I. 前言

Primates have a remarkable ability to interpret complex scenes in real time, despite the limited speed of the neuronal hardware available for such tasks. Intermediate and higher visual processes appear to select a subset of the available sensory information before further processing [1], most likely to reduce the complexity of scene analysis [2]. This selection appears to be implemented in the form of a spatially circumscribed region of the visual field, the so-called “focus of attention”, which scans the scene both in a rapid, bottom-up, saliency-driven and taskindependent manner as well as in a slower, top-down, volitioncontrolled and task-dependent manner [2]

尽管可用于此类任务的神经元硬件速度有限,但灵长类动物具有出色的实时解释复杂场景的能力。 中级和高级视觉过程似乎在进一步处理之前选择了可用的感官信息的子集[1],最有可能降低场景分析的复杂性[2]。 这种选择似乎是以视野的空间外接区域(即所谓的“关注焦点”)的形式实现的,该区域以快速,自下而上,显著性驱动且独立于任务的方式扫描场景 以及以较慢,自上而下,由自愿控制和依赖任务的方式[2]。

Models of attention include “dynamic routing” models, in which information from only a small region of the visual field can progress through the cortical visual hierarchy. The attended region is selected through dynamic modifications of cortical connectivity, or through the establishment of specific temporal patterns of activity, under both top-down (task-dependent) and bottom-up (scene-dependent) control [3], [2], [1].

注意力模型包括一些“动态路由”模型,在这种模型中,来自视野小区域的信息可以通过皮质视觉层次结构进行传递。 在自上而下(取决于任务)和自下而上(取决于场景)的控制下,通过动态修改皮质连接性或通过建立特定的活动时间模式来选择参与区域[3],[2] ,[1]。

The model proposed here (Fig. 1) builds on a second biologically-plausible architecture, proposed by Koch and Ullman [4] and at the basis of several models [5], [6]. It is related to the so-called “feature integration theory”, proposed to explain human visual search strategies [7]. Visual input is first decomposed into a set of topographic feature maps. Different spatial locations then compete for saliency within each map, such that only locations which locally stand out from their surround can persist. All feature maps feed, in a purely bottom-up manner, into a master “saliency map”, which topographically codes for local conspicuity over the entire visual scene. In primates, such a map is believed to be located in the posterior parietal cortex [8] as well as in the various visual maps in the pulvinar nuclei of the thalamus [9]. The model’s saliency map is endowed with internal dynamics which generate attentional shifts. This model consequently represents a complete account for bottom-up saliency, and does not require any top-down guidance to shift attention. This framework provides a massively parallel method for the fast selection of a small number of interesting image locations to be analyzed by more complex and time-consuming ob ject recognition processes. Extending this approach, in “guided search” feedback from higher cortical areas (e.g., knowledge about targets to be found) was used to weight the importance of different features [10], such that only those with high weights could reach higher processing levels.

《A Model of Saliency-based Visual Attention for Rapid Scene Analysis》翻译和笔记_第1张图片
图1 模型的大致结构

这里提出的模型(图1)建立在第二种生物学上合理的体系结构上,该体系是由Koch和Ullman提出的[4],并且是在几种模型[5],[6]的基础上提出的。它与所谓的“特征整合理论”有关,旨在解释人类视觉搜索策略[7]。首先将视觉输入分解为一组地形特征图。然后,不同的空间位置争夺每个地图内的显着性,因此只有在其周围局部突出的位置才能持续存在。所有功能图都以纯自下而上的方式馈入主“显着图”,该图在地形上编码了整个视觉场景中的局部显眼性。在灵长类动物中,这种图谱被认为位于顶叶后皮质[8]以及丘脑的髓核中的各种视觉图谱[9]。该模型的显着性图具有内部动力,该动力会引起注意力转移。因此,此模型代表了自下而上的显着性的完整说明,不需要任何自上而下的指导即可转移注意力。该框架提供了一种大规模并行方法,用于快速选择少量有趣的图像位置,以通过更复杂且耗时的对象识别过程进行分析。扩展此方法后,在来自较高皮层区域的“引导搜索”反馈(例如,有关要找到的目标的知识)中,对不同特征的重要性进行加权[10],从而只有那些具有较高权重的特征才能达到更高的处理水平。

II. 模型

Input is provided in the form of static color images, usually digitized at 640 × \times × 480 resolution. Nine spatial scales are created using dyadic Gaussian pyramids [11], which progressively lowpass filter and subsample the input image, yielding horizontal and vertical image reduction factors ranging from 1:1 (scale 0) to 1:256 (scale 8) in eight octaves.

输入以静态彩色图像的形式提供,通常以640 × \times × 480分辨率进行数字化。利用[11]并矢高斯金字塔建立了9个空间尺度,逐步低通滤波并对输入图像进行次采样,在8个八度音阶中产生水平和垂直图像的降阶因子,从1:1(尺度0)到1:256(尺度8)不等。

Each feature is computed by a set of linear “center-surround” operations akin to visual receptive fields (Fig. 1): Typical visual neurons are most sensitive in a small region of the visual space (the center), while stimuli presented in a broader, weaker antagonistic region concentric with the center (the surround) inhibit the neuronal response. Such architecture, sensitive to local spatial discontinuities, is particularly well-suited to detecting locations which locally stand out from their surround, and is a general computational principle in the retina, lateral geniculate nucleus and primary visual cortex [12]. Center-surround is implemented in the model as the difference between fine and coarse scales: The center is a pixel at scale c ∈ { 2 , 3 , 4 } c \in \{2,3,4\} c{2,3,4}, and the surround is the corresponding pixel at scale s = c + δ s=c+\delta s=c+δ , with δ ∈ { 3 , 4 } \delta \in \{3,4\} δ{3,4}. Across-scale difference between two maps, denoted ⊖ \ominus below, is obtained by interpolation to the finer scale and point-by-point subtraction. Using several scales not only for c c c, but also for δ = s − c \delta = s - c δ=sc , yields truly multiscale feature extraction, by including different size ratios between the center and surround regions (contrary to previously used fixed ratios [5]).

每个特征都是通过一组类似于视觉感受野的线性“中心-周围”操作来计算的(图1):典型的视觉神经元在视觉空间的一小部分(中心)最敏感,而刺激则呈现在视觉空间中。 与中心(周围)同心的较宽,较弱的拮抗区域抑制神经元反应。 这种对局部空间不连续性敏感的体系结构特别适合于检测从周围区域局部突出的位置,并且是视网膜,外侧膝状核和初级视皮层的一般计算原理[12]。中心环绕在模型中实现为细刻度和粗刻度之间的差异:中心是比例尺 c ∈ { 2 , 3 , 4 } c \in \{2,3,4\} c{2,3,4} 的像素,周围是比例为 s = c + δ s=c+\delta s=c+δ 的相应像素,其中 δ ∈ { 3 , 4 } \delta \in \{3,4\} δ{3,4} 。 通过内插至更细的比例和逐点减法,可以得到两个地图之间的跨比例差异,以下用 ⊖ \ominus 表示。 通过在中心区域和周围区域之间包括不同的大小比例(与以前使用的固定比例相反[5]),不仅对c,而且对 δ = s − c \delta = s - c δ=sc 使用多个比例,都可以产生真正的多尺度特征提取。

A. Extraction of early visual features 提取早期视觉特征

With r r r, g g g and b b b being the red, green and blue channels of the input image, an intensity image I I I is obtained as I = ( r + g + b ) = 3 I = (r+g+b)=3 I=(r+g+b)=3. I I I is used to create a Gaussian pyramid I ( σ ) I(\sigma) I(σ), where σ ∈ [ 0 , . . , 8 ] \sigma \in [0,..,8] σ[0,..,8] is the scale. The r r r, g g g and b b b channels are normalized by I I I in order to decouple hue from intensity. However, because hue variations are not perceivable at very low luminance (and hence are not salient), normalization is only applied at the locations where I I I is larger than 1/10 of its maximum over the entire image (other locations yield zero r r r; g g g and b b b). Four broadly tuned color channels are created: R = r − ( g + b ) / 2 R = r -(g + b)/2 R=r(g+b)/2 for red, G = g − ( r + b ) / 2 G = g-(r + b)/2 G=g(r+b)/2 for green, B = b − ( r + g ) / 2 B=b-(r + g)/2 B=b(r+g)/2 for blue, and Y = ( r + g ) / 2 − ∣ r − g ∣ / 2 − b Y = (r + g)/2 -|r-g|/2-b Y=(r+g)/2rg/2b for yellow (negative values are set to zero). Four Gaussian pyramids R( σ \sigma σ); G( σ \sigma σ); B( σ \sigma σ) and Y( σ \sigma σ) are created from these color channels.

r r r g g g b b b 分别表示输入图像的红色,绿色和蓝色通道,获得的亮度图像为 I = ( r + g + b ) / 3 I =(r + g + b)/3 I=(r+g+b)/3 I I I 用于创建高斯金字塔 I ( σ ) I(\sigma) I(σ),其中 σ ∈ [ 0 , . . , 8 ] \sigma \in [0,..,8] σ[0,..,8]是尺度因子。 r r r g g g b b b 通道通过 I I I 进行归一化,以使色调与亮度脱钩。 但是,由于在非常低的亮度下无法感知色相变化(因此也不显着),因此仅在 I I I 大于整个图像最大值的1/10的位置进行归一化(其他位置产生 r r r g g g b b b)。 创建了四个扩展颜色通道: R = r − ( g + b ) / 2 R = r-(g + b)/ 2 R=r(g+b)/2 表示红色, G = g − ( r + b ) / 2 G = g-(r + b)/2 G=g(r+b)/2 表示绿色, B = b − ( r + g ) / 2 B=b-(r + g)/2 B=b(r+g)/2 代表蓝色, Y = ( r + g ) / 2 − ∣ r − g ∣ / 2 − b Y=(r + g)/2-| r-g |/2-b Y=(r+g)/2rg/2b 代表黄色(负值设置为零)。 四个高斯金字塔 R ( σ ) R(\sigma) R(σ) G ( σ ) G(\sigma) G(σ) B ( σ ) B(\sigma) B(σ) R ( σ ) R(\sigma) R(σ) 从这些颜色通道创建。

Center-surround differences (defined previously) between a “center” fine scale c and a “surround” coarser scale s yield the feature maps. The first set of feature maps is concerned with intensity contrast, which in mammals is detected by neurons sensitive either to dark centers on bright surrounds, or to bright centers on dark surrounds [12]. Here, both types of sensitivities are simultaneously computed (using a rectification) in a set of six maps I

特征图通过小尺度“中心” c c c 和“周围” 大尺度 s s s 之间的中心-环绕差异(先前用 ⊖ \ominus 定义)来生成。 第一组特征图与亮度的对比度有关,强度对比度在哺乳动物中是由对明亮周围的暗中心或黑暗周围的亮中心敏感的神经元检测到的[12]。 在这里,在由六图一组的 I ( c , s ) \mathcal{I}(c,s) I(c,s) 中同时(使用校正)计算了两种类型的灵敏度。其中 c ∈ { 2 , 3 , 4 } c \in \{2,3,4\} c{2,3,4} s = c + δ , δ ∈ { 3 , 4 } s=c+\delta, \delta \in \{3,4\} s=c+δ,δ{3,4}
I ( c , s ) = ∣ I ( c ) ⊖ I ( s ) ∣ (1) \mathcal{I}(c,s)=|I(c)\ominus I(s)| \tag{1} I(c,s)=I(c)I(s)(1)

A second set of maps is similarly constructed for the color channels, which in cortex are represented using a so-called “color double-opponent” system: In the center of their receptive field, neurons are excited by one color (e.g., red) and inhibited by another (e.g., green), while the converse is true in the surround. Such spatial and chromatic opponency exists for the red/green, green/red, blue/yellow and yellow/blue color pairs in human primary visual cortex [13]. Accordingly, maps R G ( c , s ) \mathcal{RG}(c,s) RG(c,s) are created in the model to simultaneously account for red/green and green/red double opponency (Eq. 2), and B Y ( c , s ) \mathcal{BY}(c,s) BY(c,s) for blue/yellow and yellow/blue double opponency (Eq. 3):

类似地,为颜色通道构建第二组映射,在皮层中使用所谓的“颜色双对手”系统表示该映射:在其接受域的中心,神经元被一种颜色(例如红色)激发。 并被另一个(例如绿色)禁止,而在环绕中则相反。 在人类初级视觉皮层中,红色/绿色,绿色/红色,蓝色/黄色和黄色/蓝色对存在这样的空间和色彩对立[13]。 因此,在模型中创建了地图 R G ( c , s ) \mathcal{RG}(c,s) RG(c,s),以同时说明红色/绿色和绿色/红色双对像度(公式2),而 R G ( c , s ) \mathcal{RG}(c,s) RG(c,s) 表示蓝色/黄色和黄色/蓝色双对像度(公式3)

R G ( c , s ) = ∣ ( R ( c ) − G ( c ) ) ⊖ ( R ( s ) − G ( s ) ) ∣ (2) \mathcal{RG}(c,s)=\left|\left(R(c) - G(c)\right) \ominus \left(R(s) - G(s)\right)\right| \tag{2} RG(c,s)=(R(c)G(c))(R(s)G(s))(2) B Y ( c , s ) = ∣ ( B ( c ) − Y ( c ) ) ⊖ ( B ( s ) − Y ( s ) ) ∣ (3) \mathcal{BY}(c,s)=\left|\left(B(c) - Y(c)\right) \ominus \left(B(s) - Y(s)\right)\right| \tag{3} BY(c,s)=(B(c)Y(c))(B(s)Y(s))(3)

Local orientation information is obtained from I using oriented Gabor pyramids O ( σ , θ ) O(\sigma,\theta) O(σ,θ), where σ ∈ [ 0 , . . 8 ] \sigma \in [0,..8] σ[0,..8] represents the scale and θ ∈ { 0 ° , 45 ° , 90 ° , 135 ° } \theta \in \{0\degree, 45\degree, 90\degree, 135\degree \} θ{0°,45°,90°,135°} is the preferred orientation [11]. Gabor filters, which are the product of a cosine grating and a 2D Gaussian envelope, approximate the receptive field sensitivity profile (impulse response) of orientation-selective neurons in primary visual cortex [12].) Orientation feature maps, O ( c , s , θ ) O(c,s ,\theta) O(c,s,θ), encode, as a group, local orientation contrast between the center and surround scales:

局部方位信息由I使用定向Gabor金字塔 O ( σ , θ ) O(\sigma,\theta) O(σ,θ) 获得,其中 σ ∈ [ 0 , . . , 8 ] \sigma \in [0,..,8] σ[0,..,8] 表示尺度, θ ∈ { 0 ° , 45 ° , 90 ° , 135 ° } \theta \in \{0\degree, 45\degree, 90\degree, 135\degree \} θ{0°,45°,90°,135°} 为首选方位[11]。Gabor滤波器是余弦光栅和二维高斯包络线的乘积,它近似于初级视觉皮层[12]中定向选择性神经元的接受场敏感度轮廓(脉冲响应)。方向特征图 O ( c , s , θ ) O(c,s,\theta) O(c,s,θ) 编码为一组,中心和环绕尺度之间的局部方向对比(公式4)
O ( c , s , θ ) = ∣ O ( c , θ ) ⊖ O ( s , θ ) ∣ (4) O(c,s,\theta)=|O(c,\theta)\ominus O(s,\theta)| \tag{4} O(c,s,θ)=O(c,θ)O(s,θ)(4)

In total, 42 feature maps are computed: Six for intensity, 12 for color and 24 for orientation.

总共需要计算42个特征图:6个和亮度相关;12个和颜色相关;24个和方向相关。

B. The Saliency Map 把特征图组合为显著图

The purpose of the saliency map is to represent the conspicuity - or “saliency” - at every ocation in the visual field by a scalar quantity, and to guide the selection of attended locations, based on the spatial distribution of saliency. A combination of the feature maps provides bottom-up input to the saliency map, modeled as a dynamical neural network.

显著图的目的是通过一个标量表示视场中每个位置的显著性,并基于显著性的空间分布来指导关注位置的选择。 特征图的组合为显著图提供自底向上的输入,并建模为动态神经网络。

One difficulty in combining different feature maps is that they represent a priori not comparable modalities, with different dynamic ranges and extraction mechanisms. Also, because all 42 feature maps are combined, salient objects appearing strongly in only a few maps may be masked by noise or less salient objects present in a larger number of maps.

结合不同的特征图的一个困难是,它们代表了一种 先验的 而不是可比较的模式,具有不同的动态范围和提取机制。此外,由于所有42个特征图都被合并,因此仅在少数几个图中存在的显著特征可能会被其余大量特征图中存在的噪声或显著性较低的特征所掩盖。

In the absence of top-down supervision, we propose a map normalization operator, N ( ⋅ ) \mathcal{N}(\cdot) N(), which globally promotes maps in which a small number of strong peaks of activity (conspicuous locations) is present, while globally suppressing maps which contain numerous comparable peak responses. N ( ⋅ ) \mathcal{N}(\cdot) N() consists of (Fig. 2): 1) Normalizing the values in the map to a fixed range [ 0 , . . , M ] [0,..,M] [0,..,M], in order to eliminate modality-dependent amplitude differences; 2) finding the location of the map’s global maximum M M M and computing the average m ˉ \bar m mˉ of all its other local maxima; 3) globally multiplying the map by ( M − m ˉ ) 2 (M- \bar m)^2 (Mmˉ)2.

在缺乏自上而下的监督的情况下,我们提出了一种特征图归一化运算符 N ( ⋅ ) \mathcal{N}(\cdot) N(),该运算符可在全局范围加强少量强激活峰(明显位置)的特征,同时在全局范围抑制那些包含很多可对这个强激活峰造成干扰的峰值响应。 N ( ⋅ ) \mathcal{N}(\cdot) N() 由以下几部分组成(图2):1)将特征中的值归一化为固定范围 [ 0 , . . , M ] [0,..,M] [0,..,M] 的值,以消除依赖于模态的幅度差异; 2)查找特征的全局最大值 M M M 的位置,并计算其所有其他局部最大值的平均值 m ˉ \bar m mˉ; 3)将特征的值全部乘以 ( M − m ˉ ) 2 (M- \bar m)^2 (Mmˉ)2

《A Model of Saliency-based Visual Attention for Rapid Scene Analysis》翻译和笔记_第2张图片
图2 归一化操作

Only local maxima of activity are considered such that N ( ⋅ ) \mathcal{N}(\cdot) N() compares responses associated with meaningful “activitation spots” in the map and ignores homogenous areas. Comparing the maximum activity in the entire map to the average over all activation spots measures how different the most active location is from the average. When this difference is large, the most active location stands out, and we strongly promote the map. When the difference is small, the map contains nothing unique and is suppressed. The biological motivation behind the design of N ( ⋅ ) \mathcal{N}(\cdot) N() is that it coarsely replicates cortical lateral inhibition mechanisms, in which neighboring similar features inhibit each other via specific, anatomically-defined connections [15].

仅考虑活动的局部最大值,以便 N ( ⋅ ) \mathcal{N}(\cdot) N() 只对与特征图中有意义的“激活点”相关的响应进行比较,而忽略同质区域。 将整个特征图中的最大活动与所有激活点的平均值进行比较,可以衡量最活跃的位置与平均值之间的差异。 当差异很大时,最活跃的位置会脱颖而出,因此我们会大力提升该特征。 当差异很小时,特征图将不包含任何独特的内容并被抑制。 N ( ⋅ ) \mathcal{N}(\cdot) N() 设计背后的生物学动机是,它粗略地复制了皮质侧向抑制机制,其中相邻的相似特征 通过 特定的、解剖学上定义的连接彼此抑制[15]。

Feature maps are combined into three “conspicuity maps”, I ˉ \bar\mathcal{I} Iˉ for intensity (Eq. 5), C ˉ \bar\mathcal{C} Cˉ for color (Eq. 6), and O ˉ \bar\mathcal{O} Oˉ orientation (Eq. 7), at the scale ( σ = 4 ) (\sigma = 4) (σ=4) of the saliency map. They are obtained through across-scale addition, ⊕ \oplus , which consists of reduction of each map to scale 4 and point-by-point addition:

特征图在 σ = 4 \sigma = 4 σ=4 的情况下组合为三个“醒目图(conspicuity maps)”, I ˉ \bar\mathcal{I} Iˉ 代表亮度(公式 5) C ˉ \bar\mathcal{C} Cˉ 代表颜色(公式 6) O ˉ \bar\mathcal{O} Oˉ 代表方向(公式 7)。 它们是通过跨比例加法 ⊕ \oplus 获得的,其中包括将每个特征图缩小为 s = 4 s=4 s=4 并逐点添加:

I ˉ = ⊕ c = 2 4 ⊕ s = c + 3 c + 4 N ( I ( c , s ) ) (5) \bar\mathcal{I}=\oplus_{c=2}^{4}\oplus_{s=c+3}^{c+4}\mathcal{N}\left(\mathcal{I}(c,s)\right)\tag{5} Iˉ=c=24s=c+3c+4N(I(c,s))(5) C ˉ = ⊕ c = 2 4 ⊕ s = c + 3 c + 4 [ N ( R G ( c , s ) ) + N ( B Y ( c , s ) ) ] (6) \bar\mathcal{C}=\oplus_{c=2}^{4}\oplus_{s=c+3}^{c+4}\left[\mathcal{N}\left(\mathcal{RG}(c,s)\right)+\mathcal{N}\left(\mathcal{BY}(c,s)\right)\right]\tag{6} Cˉ=c=24s=c+3c+4[N(RG(c,s))+N(BY(c,s))](6)

For orientation, four intermediary maps are first created by combination of the six feature maps for a given θ \theta θ, and are then combined into a single orientation conspicuity map:

对于方向,首先将给定 θ \theta θ的6个特征地图组合创建4个中间地图,然后组合成单个方向的显著图:

O ˉ = ∑ θ ∈ { 0 ° , 45 ° , 90 ° , 135 ° } N ( ⊕ c = 2 4 ⊕ s = c + 3 c + 4 N ( O ( c , s , θ ) ) ) (7) \bar\mathcal{O}=\sum_{\theta \in \{0\degree,45\degree,90\degree,135\degree\}}\mathcal{N}\left(\oplus_{c=2}^{4}\oplus_{s=c+3}^{c+4}\mathcal{N}\left(\mathcal{O}(c,s,\theta)\right)\right)\tag{7} Oˉ=θ{0°,45°,90°,135°}N(c=24s=c+3c+4N(O(c,s,θ)))(7)

The motivation for the creation of three separate channels, I ˉ \bar\mathcal{I} Iˉ, C ˉ \bar\mathcal{C} Cˉ and O ˉ \bar\mathcal{O} Oˉ, and their individual normalization is the hypothesis that similar features compete strongly for saliency, while different modalities contribute independently to the saliency map. The three conspicuity maps are normalized and summed into the final input S \mathcal{S} S to the saliency map:

创建三个独立的通道 I ˉ \bar\mathcal{I} Iˉ C ˉ \bar\mathcal{C} Cˉ O ˉ \bar\mathcal{O} Oˉ 以及对其进行各自归一化的动机是假设相似的特征在显著性上产生强烈竞争,而不同的形态也独立影响显著性。将三个显著性图归一化,并求和为显著性图的最终输入 S \mathcal{S} S (公式8)
S = 1 3 ( N ( I ˉ ) + N ( C ˉ ) + N ( O ˉ ) ) (8) \mathcal{S}=\frac{1}{3}\left(\mathcal{N}(\bar\mathcal{I})+\mathcal{N}(\bar\mathcal{C})+\mathcal{N}(\bar\mathcal{O})\right)\tag{8} S=31(N(Iˉ)+N(Cˉ)+N(Oˉ))(8)

At any given time, the maximum of the saliency map (SM) defines the most salient image location, to which the focus of attention (FOA) should be directed. We could now simply select the most active location as defining the point where the model should next attend to. However, in a neuronally-plausible implementation, we model the SM as a 2D layer of leaky integrate-and-fire neurons at scale 4. These model neurons consist of a single capacitance which integrates the charge delivered by synaptic input, of a leakage conductance, and of a voltage threshold. When threshold is reached, a prototypical spike is generated, and the capacitive charge is shunted to zero [14]. The SM feeds into a biologically-plausible 2D “winner-take-all” (WTA) neural network [4], [1] at scale σ = 4 \sigma = 4 σ=4, in which synaptic interactions among units ensure that only the most active location remains, while all other locations are suppressed.

在任何给定时间,显著图(SM)的最大值定义了关注焦点(FOA)应指向的最显着图像位置。 我们现在可以简单地选择最活跃的位置来定义模型下一次应关注的点。 但是,在神经元似是而非的实现方式中,我们将SM建模为第4标度的泄漏 整合并发射 神经元的2D层。这些模型神经元由单个电容组成,它整合了由突触输入、泄漏电导和电压阈值所传递的电荷。当阈值达到时,产生一个典型的脉冲,电容电荷被分流到零[14]。 SM在尺度 σ = 4 \sigma = 4 σ=4 上馈入生物学上可行的二维“赢家通吃”(WTA)神经网络[4],[1],其中各单元之间的突触相互作用确保仅保留最活跃的位置, 而所有其他位置均被抑制。

The neurons in the SM receive excitatory inputs from S \mathcal{S} S and are all independent. The potential of SM neurons at more salient locations hence increases faster (these neurons are used as pure integrators and do not fire). Each SM neuron excites its corresponding WTA neuron. All WTA neurons also evolve independently of each other, until one (the “winner”) first reaches threshold and fires. This triggers three simultaneous mechanisms (Fig. 3): 1) The FOA is shifted to the location of the winner neuron; 2) the global inhibition of the WTA is triggered and completely inhibits (resets) all WTA neurons; 3) local inhibition is transiently activated in the SM, in an area with the size and new location of the FOA; this not only yields dynamical shifts of the FOA, by allowing the next most salient location to subsequently become the winner, but it also prevents the FOA from immediately returning to a previously attended location. Such an “inhibition of return” has been demonstrated in human visual psychophysics [16]. In order to slightly bias the model to subsequently jump to salient locations spatially close to the currently attended location, a small excitation is transiently activated in the SM, in a near surround of the FOA (“proximity preference” rule of Koch and Ullman [4]).

SM中的神经元从 S \mathcal{S} S 接收兴奋性输入,并且都是独立的。因此,SM神经元在更多显着位置的电势增加更快(这些神经元被用作纯积分器,不会激发)。每个SM神经元都激发其相应的WTA神经元。所有WTA神经元也彼此独立地进化,直到一个(“优胜者”)首先达到阈值并触发。这触发了三个同时发生的机制(图3):1)FOA转移到了胜利者神经元的位置; 2)触发对WTA的全局抑制,并完全抑制(重置)所有WTA神经元; 3)在具有FOA的大小和新位置的区域中,SM中的局部抑制被瞬时激活;通过允许下一个最显着的位置随后成为赢家,这不仅产生了FOA的动态变化,而且还阻止了FOA立即返回到以前参加的位置。这种“抑制返回”已在人类视觉心理物理学中得到证明[16]。为了稍微偏向模型,使其随后跳转到空间上接近当前关注位置的显着位置,在FOA的近处(Koch和Ullman的“邻近偏好”规则[4]中,在SM中短暂激活了一个小的激励。)

《A Model of Saliency-based Visual Attention for Rapid Scene Analysis》翻译和笔记_第3张图片
图3 模型处理自然图像的操作示例。并行特征提取会生成三个对比图,分别是颜色对比 C ˉ \bar\mathcal{C} Cˉ 、强度对比 I ˉ \bar\mathcal{I} Iˉ 和方向对比 O ˉ \bar\mathcal{O} Oˉ。~这些被组合以形成对显着图(SM)的输入 S \mathcal{S} S。最明显的位置是橙色电话亭,它在 C ˉ \bar\mathcal{C} Cˉ 中显得很强烈。它成为第一个被关注的位置(模拟时间为92毫秒)。在返回抑制反馈在显着图中抑制该位置之后,依次选择下一个最显着的位置。

Since we do not model any top-down attentional component, the FOA is a simple disk whose radius is fixed to one sixth of the smaller of the input image width or height. The time constants, conductances, and firing thresholds of the simulated neurons were chosen (see ref. [17] for details) so that the FOA jumps from one salient location to the next in approximately 30-70 ms (simulated time), and that an attended area is inhibited for approximately 500-900 ms (Fig. 3), as has been observed psychophysically [16]. The difference in the relative magnitude of these delays proved sufficient to ensure thorough scanning of the image, and prevented cycling through only a limited number of locations. All parameters are fixed in our implementation [17], and the system proved stable in time for all images studied.

由于我们没有对任何自上而下的注意力成分进行建模,因此FOA是一个简单的圆盘,其半径固定为输入图像宽度或高度中较小者的六分之一。 选择模拟神经元的时间常数,电导和激发阈值(有关详细信息,请参见参考文献[17]),以便FOA在大约30 ~ 70 ms(模拟时间)内从一个显着位置跳到下一个显着位置,并且 正如心理上观察到的那样,有人照看的区域被抑制了大约500 ~ 900 ms(图3)。 这些延迟的相对大小差异被证明足以确保彻底扫描图像,并防止仅在有限数量的位置循环。 所有参数在我们的实现中都是固定的[17],并且该系统对于所研究的所有图像在时间上都证明是稳定的。

C. Comparison with spatial frequency content models 与空间频率内容模型的比较

Reinagel and Zador [18] recently used an eye-tracking device to analyze the local spatial frequency distributions along eye scan paths generated by humans while free-viewing grayscale images. They found the spatial frequency content at the fixated locations to be significantly higher than, on average, at random locations. Although eye tra jectories can differ from attentional tra jectories under volitional control [1], visual attention is often thought as a pre-occulomotor mechanism, strongly in uencing free-viewing. It was hence interesting to investigate whether our model would reproduce the findings of Reinagel and Zador.

Reinagel和Zador [18]最近使用一种眼动追踪设备来分析人在自由观看灰度图像时沿人类产生的眼扫描路径的局部空间频率分布。 他们发现,固定位置的空间频率含量明显高于随机位置的平均水平。 尽管在自主控制下眼球轨迹与注意力轨迹可能有所不同[1],但视觉注意力通常被认为是一种前共动机制,对自由观看有很强的诱导作用。因此,研究我们的模型是否能够重现Reinagel和Zador的发现很有趣。

We constructed a simple measure of spatial frequency content (SFC): At a given image location, a 16 × 16 16\times 16 16×16 image patch is extracted from each I ( 2 ) I(2) I(2), R ( 2 ) R(2) R(2), G ( 2 ) G(2) G(2), B ( 2 ) B(2) B(2) and Y ( 2 ) Y(2) Y(2) map, and 2D Fast Fourier Transforms (FFTs) are applied to the patches. For each patch, a threshold is applied to compute the number of non-negligible FFT coefficients; the threshold corresponds to the FFT amplitude of a just perceivable grating (1% contrast). The SFC measure is the average of the numbers of non-negligible coefficients in the five corresponding patches. The size and scale of the patches were chosen such that the SFC measure is sensitive to approximately the same frequency and resolution ranges as our model; also, our SFC measure is computed in the RGB channels as well as in intensity, like the model. Using this measure, an SFC map is created at scale 4 for comparison with the saliency map (Fig. 4).

《A Model of Saliency-based Visual Attention for Rapid Scene Analysis》翻译和笔记_第4张图片
图4 彩色图像(a),相应的显著图输入(b),空间频率内容(SFC)示图(c),显着性图输入大于其最大值的98%的位置(d;黄色圆圈)的示例 ,以及SFC高于其最大值的98%(d;红色方块)的图像块。 显著图对噪声非常鲁棒,而SFC则不然。

我们构造了一个简单的空间频率含量(SFC)度量:在给定的图像位置,从每个 I ( 2 ) I(2) I(2) R ( 2 ) R(2) R(2) G ( 2 ) G(2) G(2) B ( 2 ) B(2) B(2) Y ( 2 ) Y(2) Y(2) 中提取一个 16 × 16 16\times 16 16×16 的图像块,对进行2D快速傅里叶变换(FFT)。 对于每个图像块,都用一个阈值来计算不可忽略的FFT系数的数量; 阈值对应于可感知光栅的FFT幅度(对比度为1%)。 SFC度量是五个对应色块中不可忽略系数的数量的平均值。 选择图片块的大小和尺度,以使SFC测量对与我们的模型大致相同的频率和分辨率范围敏感; 同样,像模型一样,我们的SFC度量是在RGB通道以及强度中计算的。 使用此度量,将在比例尺4处创建SFC映射,以与显著性映射进行比较(图4)。

III. 结果和讨论

Although the concept of a saliency map has been widely used in focus-of-attention models [1], [3], [7], little detail is usually provided about its construction and dynamics. Here we examine how the feedforward feature extraction stages, the map combination strategy, and the temporal properties of the saliency map all contribute to the overall system performance.

尽管显著图的概念已广泛用于关注焦点模型[1],[3],[7],但通常很少提供有关其构造和动力学和动态的细节息。 在这里,我们检查前馈特征提取阶段,图组合策略以及显著图的时间属性对整体系统性能的影响。

A. General performance 一般性能

The model was extensively tested with artificial images to ensure proper functioning. For example, several ob jects of same shape but varying contrast with the background were attended to in order of decreasing contrast. The model proved very robust to the addition of noise to such images (Fig. 5), particularly if the properties of the noise (e.g., its color) were not directly con icting with the main feature of the target.

《A Model of Saliency-based Visual Attention for Rapid Scene Analysis》翻译和笔记_第5张图片
噪声对检测性能的影响,以 768 × 512 768\times 512 768×512 的场景为例,(两个人)因其独特的色彩对比度而凸显出来。在目标被发现之前,错误检测的平均值 ± S . E . \pm S.E. ±S.E. 为噪声采样50次时的噪声密度的函数。 该系统对噪声有很强的鲁棒性,不会直接干扰目标的主要特征(左:强度噪声和彩色目标)。 当噪声具有与目标相似的特性时,它将削弱目标的显着性,并且系统首先会关注其他特征(此处是强度的粗略变化)的显著对象。

该模型通过人工图像进行了广泛的测试,以确保正常运行。例如,按照对比度递减的顺序处理了几个形状相同但与背景对比度不同的对象。事实证明,向这些图像添加噪声,该模型的表现非常稳健(图5),特别是当噪声的属性(例如其颜色)没有直接与目标的主要特征发生冲突时。

The model was able to reproduce human performance for a number of pop-out tasks [7], using images of the type shown in Fig. 2. When a target differed from an array of surrounding distractors by its unique orientation (like in Fig. 2), color, intensity or size, it was always the first attended location, irrespectively of the number of distractors. Contrarily, when the target differed from the distractors only by a conjunction of features (e.g., it was the only red-horizontal bar in a mixed array of red-vertical and green-horizontal bars), the search time necessary to find the target increased linearly with the number of distractors. Both results have been widely observed in humans [7], and are discussed in Section III-B.

使用图2所示的图像,该模型能够重现人类在许多弹出式任务[7]中的表现。当一个目标因其独特的方向(如图2所示)、颜色、强度或大小与周围一系列干扰物不同时,它总是成为第一个被关注的位置,而不管干扰物的数量是多少。相反,当目标与干扰物之间仅存在一些特征差异时(例如,它是红-垂直和绿-水平条混合数组中唯一的红-水平条),寻找目标所需的搜索时间随干扰物数量线性增加。这两个结果在人类[7]中被广泛观察到,并在第III-B节中讨论。

We also tested the model with real images, ranging from natural outdoor scenes to artistic paintings, and using N ( ⋅ ) \mathcal{N}(\cdot) N() to normalize the feature maps (Fig. 3 and ref. [17]). With many such images, it is difficult to ob jectively evaluate the model, because no ob jective reference is available for comparison, and observers may disagree on which locations are the most salient. However, in all images studied, most attended locations were objects of interest, such as faces, flags, persons, buildings or vehicles.

我们还使用从自然室外场景到艺术绘画的真实图像对模型进行了测试,并使用 N ( ⋅ ) \mathcal{N}(\cdot) N() 来对特征图进行归一化(图3和参考文献[17])。 对于许多这样的图像,很难客观地评估模型,因为没有客观的参考可用于比较,并且观察者可能会在哪个位置最显着上存在分歧。 但是,在所研究的所有图像中,被关注最多的位置是感兴趣的对象,例如面部,旗帜,人物,建筑物或车辆。

Model predictions were compared to the measure of local SFC, in an experiment similar to that of Reinagel and Zador [18], using natural scenes with salient traffic signs (90 images), red soda can (104 images), or vehicle’s emergency triangle (64 images). Similar to Reinagel and Zador’s findings, the SFC at attended locations was significantly higher than the average SFC, by a factor decreasing from 2.5 ± 0.05 2.5\pm 0.05 2.5±0.05 at the first attended location to 1.6 ± 0.05 1.6 \pm 0.05 1.6±0.05 at the 8th attended location. Although this result does not necessarily indicate similarity between human eye fixations and the model’s attentional tra jectories, it indicates that the model, like humans, is attracted to “informative” image locations, according to the common assumption that regions with richer spectral content are more informative. The SFC map was similar to the saliency map for most images (e.g., Fig. 4.1). However, both maps differed substantially for images with strong, extended variations of illumination or color (e.g., due to speckle noise): While such areas exhibited uniformly high SFC, they had low saliency because of their uniformity (Figs. 4.2, 4.3). In such images, the saliency map was usually in better agreement with our sub jective perception of saliency. Quantitatively, for the 258 images studied here, the SFC at attended locations was significantly lower than the maximum SFC, by a factor decreasing from 0.90 ± 0.02 0.90 \pm 0.02 0.90±0.02 at the first attended location to 0.55 ± 0.05 0.55 \pm 0.05 0.55±0.05 at the 8th attended location, While the model was attending to locations with high SFC, these were not necessarily the locations with highest SFC. It consequently seems that saliency is more than just a measure of local SFC. The model, which implements within-feature spatial competition captured sub jective saliency better than the purely local SFC measure.

在一个类似Reinagel和Zador[18]的实验中,模型预测与局部SFC的测量相比较,使用具有显著交通标志的自然场景(90张图像)、红色苏打罐(104张图像)或车辆的紧急三角形(64张图像)。与Reinagel和Zador的研究结果相似,在第一个参与者参与的地点,证券交易委员会显著高于平均水平,从2.5+0.05下降到1.6 +0.05,在第8个参与者参与的地点。虽然这一结果并不一定表明人眼注视和模型的注意轨迹之间的相似性,但它表明模型和人类一样,会被“信息化”的图像位置所吸引,这是根据普遍的假设,光谱含量越丰富的区域信息越丰富。SFC图与大多数图像的显著性图相似(如图4.1)。然而,对于光照或颜色变化强烈、扩展的图像(例如,由于散斑噪声),这两种地图存在显著差异:虽然这些区域显示出均匀的高SFC,但由于它们的均匀性,它们的显著性较低(图4.2、4.3)。在这些图像中,显著性图通常与我们对显著性的主观感知更一致。从数量上看,在这里研究的258张图像中,参加活动地点的SFC显著低于最大SFC,从第一次参加活动地点的0.90 + 0.02下降到第8次参加活动地点的0.55 + 0.05:而模型位置高的证监会,这些并不是证监会最高的位置。因此,卓越不仅仅是一个衡量当地证监会。模型,实现within-feature空间竞争捕获子jective显著比纯粹的当地证监会措施。

在类似于Reinagel和Zador [18]的实验中,使用具有明显交通标志的自然场景(90张图像),红色汽水罐(104张图像)或车辆紧急三角形(18张),将模型预测与本地SFC的测量值进行了比较。 64张图片)。与Reinagel和Zador的发现相似,被关注位置的SFC值明显高于平均SFC,第一关注位置的SFC是平均值的 2.5 ± 0.05 2.5\pm 0.05 2.5±0.05 倍,到第八关注位置的 1.6 ± 0.05 1.6 \pm 0.05 1.6±0.05 倍逐渐递减。虽然这一结果并不一定表明人眼注视和模型的注意轨迹之间的相似性,但它表明模型和人类一样,会被“信息化”的图像位置所吸引,这是根据普遍的假设,光谱含量越丰富的区域信息越丰富。对于大多数图像,SFC图与显著图相似(如图4.1)。
然而,对于光照或颜色变化强烈、扩展的图像(例如,由于散斑噪声),这两种地图存在显著差异:虽然这些区域显示出均匀的高SFC,但由于它们的均匀性,它们的显著性较低(图4.2、4.3)。在这些图像中,显著性图通常与我们对显著性的主观感知更好地吻合。从数量上看,对于此处研究的258张图像,被关注位置的SFC显著低于最大SFC,从第一被关注位置的 0.90 ± 0.02 0.90 \pm 0.02 0.90±0.02 降到第八第一被关注位置的 0.55 ± 0.05 0.55 \pm 0.05 0.55±0.05当模型关注的是SFC较高的位置时,这些不一定是SFC最高的位置。因此,显著性似乎不仅仅是衡量局部SFC的指标。与仅使用局部SFC度量相比,实现功能内空间竞争的模型更能体现目标的显著性。 这段有点拗口

B. Strengths and limitations 优势和局限性

We have proposed a model whose architecture and components mimic the properties of primate early vision. Despite its simple architecture and feedforward feature extraction mechanisms, the model is capable of strong performance with complex natural scenes. For example, it quickly detected salient trafficc signs of varied shapes (round, triangular, square, rectangular), colors (red, blue, white, orange, black) and textures (letter markings, arrows, stripes, circles), although it had not been designed for this purpose. Such strong performance reinforces the idea that a unique saliency map, receiving input from early visual processes, could effectively guide bottom-up attention in primates [4], [10], [5], [8]. From a computational viewpoint, the ma jor strength of this approach lies in the massively parallel implementation, not only of the computationally expensive early feature extraction stages, but also of the attention focusing system. More than previous models based extensively on relaxation techniques [5], our architecture could easily allow for real-time operation on dedicated hardware.

我们提出了一个模型,该模型的体系结构和组件模仿灵长类动物早期视力的特性。尽管其简单的体系结构和前馈特征提取机制,但该模型在复杂的自然场景下仍具有出色的性能。例如,它具有各种形状(圆形,三角形,正方形,矩形),颜色(红色,蓝色,白色,橙色,黑色)和纹理(字母标记,箭头,条纹,圆圈)的显着交通标志,尽管它具有不是为此目的而设计的。如此强大的性能强化了这样的想法,即独特的显着性图(从早期视觉过程中接收输入)可以有效地引导灵长类动物自下而上的注意力[4],[10],[5],[8]。从计算的角度来看,这种方法的主要优势在于大规模并行实现,不仅是计算上昂贵的早期特征提取阶段,而且是注意力集中系统。与以前基于扩展技术[5]的模型相比,我们的体系结构可以轻松地在专用硬件上进行实时操作。

The type of performance which can be expected from this model critically depends on one factor: Only ob ject features explicitely represented in at least one of the feature maps can lead to pop-out, that is, rapid detection independently of the number of distracting ob jects [7]. Without modifying the preattentive feature extraction stages, our model cannot detect conjunctions of features. While our system immediately detects a target which differs from surrounding distractors by its unique size, intensity, color or orientation (properties which we have implemented because they have been very well characterized in primary visual cortex), it will fail at detecting targets salient for unimplemented feature types (e.g., T junctions or line terminators, for which the existence of specific neural detectors remains controversial). For simplicity, we also have not implemented any recurrent mechanism within the feature maps, and hence cannot reproduce phenomena like contour completion and closure, important for certain types of human pop-out [19]. In addition, at present our model does not include any magnocellular motion channel, known to play a strong role in human saliency [5].

该模型可以预期的性能类型主要取决于一个因素:只有在至少一个特征图中明确表示的物体特征才能导致弹出,即与分散物体数量无关的快速检测主题[7]。如果不修改注意力集中的特征提取阶段,我们的模型将无法检测特征的合取。尽管我们的系统会立即检测到一个与周围干扰物不同的目标,该目标的独特大小,强度,颜色或方向(我们已经实现了这些属性,因为它们已经在主要视觉皮层中很好地表征了),但它无法检测到未实现的目标特征类型(例如,T结或线终止符,其特定的神经检测器的存在仍然引起争议)。为简单起见,我们还没有在特征图中实现任何递归机制,因此无法重现轮廓完成和闭合之类的现象,这对于某些类型的人弹出窗口很重要[19]。另外,目前我们的模型不包括任何在人类显着性中起重要作用的大细胞运动通道[5]。

A critical model component is the normalization N ( ⋅ ) \mathcal{N}(\cdot) N(), which provided a general mechanism for computing saliency in any situation. The resulting saliency measure implemented by the model, although often related to local SFC, was closer to human saliency because it implemented spatial competition between salient locations. Our feed-forward implementation of N ( ⋅ ) \mathcal{N}(\cdot) N() is faster and simpler than previously proposed iterative schemes [5]. Neuronally, spatial competition effects similar to N have been observed in the non-classical receptive field of cells in striate and extrastriate cortex [15].

关键模型组件是归一化 N ( ⋅ ) \mathcal{N}(\cdot) N(),它为在任何情况下计算显著性提供了一种通用机制。该模型实现的显着性度量尽管通常与局部SFC有关,但由于它在显著位置之间进行了空间竞争,因此更接近于人类显著性。 我们的 N ( ⋅ ) \mathcal{N}(\cdot) N() 前馈实现比以前提出的迭代方案更快,更简单[5]。 在神经元上,在纹状体和纹状体皮层的细胞的非经典感受野中观察到了类似于N的空间竞争效应[15]。

In conclusion, we have presented a conceptually simple computational model for saliency-driven focal visual attention. The biological insight guiding its architecture proved efficient in reproducing some of the performances of primate visual systems. The efficiency of this approach for target detection critically depends on the features types implemented. The framework presented here can consequently be easily tailored to arbitrary tasks through the implementation of dedicated feature maps.

总之,我们为显着性驱动的焦点视觉注意力提出了一个概念上简单的计算模型。 指导其架构的生物学见解被证明可以有效地复制灵长类视觉系统的某些性能。 这种用于目标检测的方法的效率主要取决于实现的功能类型。 因此,通过实现专用特征图,可以轻松地为任意任务定制此处介绍的框架。


笔记 需要用来写论文,暂时不公开。

你可能感兴趣的:(论文相关)