histogram loss[1] 用来做 embedding learning,思路是最小化 negative pair 相似性比 positive pair 还大的可能性/概率,其中相似性用 embeddings 的内积表示: s i j = < x i , x j > s_{ij}=
上述的那个概率就是文中的公式 3:
p r e v e r s e = ∫ − 1 1 p − ( x ) [ ∫ − 1 x p + ( y ) d y ] d x (a) p_{reverse}=\int_{-1}^1p^-(x)\left[\int_{-1}^xp^+(y)dy\right]dx \tag{a} preverse=∫−11p−(x)[∫−1xp+(y)dy]dx(a)
其中 p − ( ⋅ ) p^-(\cdot) p−(⋅) 是 s − s^- s− 的分布, p + ( ⋅ ) p^+(\cdot) p+(⋅) 是 s + s^+ s+ 的分布,需要估计。
密度估计的思路见 [3],这里符号尽量沿用 [1],省去 +/- 上标因为两者都是这么估计。先将密度函数的极限定义式写出来:
p ( t r ) = lim Δ → 0 Φ ( t r + 1 ) − Φ ( t r − 1 ) 2 Δ p(t_r)=\lim_{\Delta\rightarrow0}\frac{\Phi(t_{r+1})-\Phi(t_{r-1})}{2\Delta} p(tr)=Δ→0lim2ΔΦ(tr+1)−Φ(tr−1)
文中在 [ − 1 , 1 ] [-1,1] [−1,1] 均匀插了 R 个点, Δ = 2 R − 1 \Delta=\frac{2}{R-1} Δ=R−12 是组距, t r t_r tr 是其中一个插的点, r ∈ { 1 , … , R } r\in\{1,\dots,R\} r∈{1,…,R}, Φ ( ⋅ ) \Phi(\cdot) Φ(⋅) 是 p ( ⋅ ) p(\cdot) p(⋅) 对应的累积分布函数。于是离散地估计 (a) 式:
Φ ( t r + 1 ) − Φ ( t r − 1 ) ≈ ϕ r + 1 − ϕ r − 1 = ∑ s ∈ [ t r − 1 , t r + 1 ] 1 ∣ S ∣ \Phi(t_{r+1})-\Phi(t_{r-1})\approx\phi_{r+1}-\phi_{r-1}=\sum_{s\in[t_{r-1},t_{r+1}]}\frac{1}{|S|} Φ(tr+1)−Φ(tr−1)≈ϕr+1−ϕr−1=s∈[tr−1,tr+1]∑∣S∣1
此处 ϕ r \phi_r ϕr 是离散的累积分布函数,
p ( t r ) ≈ h r Δ = # s ∈ [ t r − 1 , t r + 1 ] 2 ∣ S ∣ 1 Δ p(t_r)\approx\frac{h_r}{\Delta}=\frac{\#s\in[t_{r-1},t_{r+1}]}{2|S|}\frac{1}{\Delta} p(tr)≈Δhr=2∣S∣#s∈[tr−1,tr+1]Δ1
这可以看成直方图:第 r 个矩形, Δ \Delta Δ 为组距,分子统计落入 t r t_r tr 邻域的点数,第一项是区间的概率和(分母的 2 是因为对于 s i j ∈ [ t x , t x + 1 ] s_{ij}\in[t_x,t_{x+1}] sij∈[tx,tx+1],会分别在考虑 p ( t x ) p(t_x) p(tx) 和 p ( t x + 1 ) p(t_{x+1}) p(tx+1) 被考虑,相当于被算了两次),除以组距得到密度。
由 [3] 可以看到,这种估计可以改写成核函数的形式:
h r = 1 ∣ S ∣ ∑ ( i , j ) 1 2 ⋅ 1 ( ∣ s i j − t r ∣ Δ ≤ 1 ) (b) h_r=\frac{1}{|S|}\sum_{(i,j)}\frac{1}{2}\cdot1(\frac{|s_{ij}-t_r|}{\Delta}\leq1)\tag{b} hr=∣S∣1(i,j)∑21⋅1(Δ∣sij−tr∣≤1)(b)
其中所用的核可以看成: K Δ ( x − t r ) = 1 2 ⋅ 1 ( ∣ x − t r ∣ Δ ≤ 1 ) K_{\Delta}(x-t_r)=\frac{1}{2}\cdot1(\frac{|x-t_r|}{\Delta}\leq1) KΔ(x−tr)=21⋅1(Δ∣x−tr∣≤1)。于是一种自然的扩展就是:用别的核函数。文中用了 triangular kernel[8],公式 (2) 可以改写成:
δ i , j , r = ( 1 − ∣ s i j − t r ∣ Δ ) ⋅ 1 ( ∣ s i j − t r ∣ Δ ≤ 1 ) (c) \delta_{i,j,r}=(1-\frac{|s_{ij}-t_r|}{\Delta})\cdot1(\frac{|s_{ij}-t_r|}{\Delta}\leq1)\tag{c} δi,j,r=(1−Δ∣sij−tr∣)⋅1(Δ∣sij−tr∣≤1)(c)
于是公式 (1) 就是:
h r = 1 ∣ S ∣ ∑ ( i , j ) ( 1 − ∣ s i j − t r ∣ Δ ) ⋅ 1 ( ∣ s i j − t r ∣ Δ ≤ 1 ) (d) h_r=\frac{1}{|S|}\sum_{(i,j)}(1-\frac{|s_{ij}-t_r|}{\Delta})\cdot1(\frac{|s_{ij}-t_r|}{\Delta}\leq1)\tag{d} hr=∣S∣1(i,j)∑(1−Δ∣sij−tr∣)⋅1(Δ∣sij−tr∣≤1)(d)
和 (b) 式对比着看,换这个核使得 h r h_r hr 里带有 s i j s_{ij} sij,从而可以回传梯度。
可以验证这个 { h r } \{h_r\} {hr} 序列是一个合法的概率分布:易知 0 < h r ≤ 1 0
∑ r = 1 R h r = 1 ∣ S ∣ ∑ ( i , j ) [ s i j − t k Δ + t k + 1 − s i j Δ ] = ∑ ( i , j ) 1 ∣ S ∣ = 1 \begin{aligned}\sum_{r=1}^Rh_r&=\frac{1}{|S|}\sum_{(i,j)}[\frac{s_{ij}-t_k}{\Delta}+\frac{t_{k+1}-s_{ij}}{\Delta}] \\ &=\sum_{(i,j)}\frac{1}{|S|}=1 \end{aligned} r=1∑Rhr=∣S∣1(i,j)∑[Δsij−tk+Δtk+1−sij]=(i,j)∑∣S∣1=1
最终对 (a) 式的估计就写成文中公式 (4),即 histogram loss:
L = ∑ r = 1 R ( h r − ∑ q = 1 r h q + ) L=\sum_{r=1}^R(h_r^-\sum_{q=1}^rh_q^+) L=r=1∑R(hr−q=1∑rhq+)
#import tensorflow as tf
def cos(X, Y=None):
"""C(i,j) = cos(Xi, Yj)"""
X_n = tf.math.l2_normalize(X, axis=1)
if (Y is None) or (X is Y):
return tf.matmul(X_n, tf.transpose(X_n))
Y_n = tf.math.l2_normalize(Y, axis=1)
return tf.matmul(X_n, tf.transpose(Y_n))
def sim_mat(label, label2=None):
"""S[i][j] = 1 <=> i- & j-th share at lease 1 label"""
if label2 is None:
label2 = label
return tf.cast(tf.matmul(label, tf.transpose(label2)) > 0, "float32")
def histogram_loss(X, L, R=151):
"""histogram loss
X: [n, d], feature WITHOUT L2 norm
L: [n, c], label
R: scalar, num of estimating point, same as the paper
"""
delta = 2. / (R - 1) # step
# t = (t_1, ..., t_R)
t = tf.lin_space(-1., 1., R)[:, None] # [R, 1]
# gound-truth similarity matrix
M = sim_mat(L) # [n, n]
# cosine similarity, in [-1, 1]
S = cos(X) # [n, n]
# get indices of upper triangular (without diag)
S_hat = S + 2 # shift value to [1, 3] to ensure triu > 0
S_triu = tf.linalg.band_part(S_hat, 0, -1) * (1 - tf.eye(tf.shape(S)[0]))
triu_id = tf.where(S_triu > 0)
# extract triu -> vector of [n(n - 1) / 2]
S = tf.gather_nd(S, triu_id)[None, :] # [1, n(n-1)/2]
M_pos = tf.gather_nd(M, triu_id)[None, :]
M_neg = 1 - M_pos
scaled_abs_diff = tf.math.abs(S - t) / delta # [R, n(n-1)/2]
# mask_near = tf.cast(scaled_abs_diff <= 1, "float32")
# delta_ijr = (1 - scaled_abs_diff) * mask_near
delta_ijr = tf.maximum(0, 1 - scaled_abs_diff)
def histogram(mask):
"""h = (h_1, ..., h_R)"""
sum_delta = tf.reduce_sum(delta_ijr * mask, 1) # [R]
return sum_delta / tf.maximum(1, tf.reduce_sum(mask))
h_pos = histogram(M_pos)[None, :] # [1, R]
h_neg = histogram(M_neg) # [R]
# all 1 in lower triangular (with diag)
mask_cdf = tf.linalg.band_part(tf.ones([R, R]), -1, 0)
cdf_pos = tf.reduce_sum(mask_cdf * h_pos, 1) # [R]
loss = tf.reduce_sum(h_neg * cdf_pos)
return loss