histogram loss笔记

histogram loss[1] 用来做 embedding learning,思路是最小化 negative pair 相似性比 positive pair 还大的可能性/概率,其中相似性用 embeddings 的内积表示: s i j = < x i , x j > s_{ij}= sij=<xi,xj>,按 positive/negative 分成 s + s^+ s+ s − s^- s 两拨,对应文中 S + S^+ S+ S − S^- S 两个集合。将 embeddings 做 L2 normalization,相似性范围就是 [ − 1 , 1 ] [-1,1] [1,1]
上述的那个概率就是文中的公式 3:
p r e v e r s e = ∫ − 1 1 p − ( x ) [ ∫ − 1 x p + ( y ) d y ] d x (a) p_{reverse}=\int_{-1}^1p^-(x)\left[\int_{-1}^xp^+(y)dy\right]dx \tag{a} preverse=11p(x)[1xp+(y)dy]dx(a)
其中 p − ( ⋅ ) p^-(\cdot) p() s − s^- s 的分布, p + ( ⋅ ) p^+(\cdot) p+() s + s^+ s+ 的分布,需要估计。

Density Estimation

密度估计的思路见 [3],这里符号尽量沿用 [1],省去 +/- 上标因为两者都是这么估计。先将密度函数的极限定义式写出来:
p ( t r ) = lim ⁡ Δ → 0 Φ ( t r + 1 ) − Φ ( t r − 1 ) 2 Δ p(t_r)=\lim_{\Delta\rightarrow0}\frac{\Phi(t_{r+1})-\Phi(t_{r-1})}{2\Delta} p(tr)=Δ0lim2ΔΦ(tr+1)Φ(tr1)
文中在 [ − 1 , 1 ] [-1,1] [1,1] 均匀插了 R 个点, Δ = 2 R − 1 \Delta=\frac{2}{R-1} Δ=R12 是组距, t r t_r tr 是其中一个插的点, r ∈ { 1 , … , R } r\in\{1,\dots,R\} r{1,,R} Φ ( ⋅ ) \Phi(\cdot) Φ() p ( ⋅ ) p(\cdot) p() 对应的累积分布函数。于是离散地估计 (a) 式:
Φ ( t r + 1 ) − Φ ( t r − 1 ) ≈ ϕ r + 1 − ϕ r − 1 = ∑ s ∈ [ t r − 1 , t r + 1 ] 1 ∣ S ∣ \Phi(t_{r+1})-\Phi(t_{r-1})\approx\phi_{r+1}-\phi_{r-1}=\sum_{s\in[t_{r-1},t_{r+1}]}\frac{1}{|S|} Φ(tr+1)Φ(tr1)ϕr+1ϕr1=s[tr1,tr+1]S1
此处 ϕ r \phi_r ϕr 是离散的累积分布函数,
p ( t r ) ≈ h r Δ = # s ∈ [ t r − 1 , t r + 1 ] 2 ∣ S ∣ 1 Δ p(t_r)\approx\frac{h_r}{\Delta}=\frac{\#s\in[t_{r-1},t_{r+1}]}{2|S|}\frac{1}{\Delta} p(tr)Δhr=2S#s[tr1,tr+1]Δ1
这可以看成直方图:第 r 个矩形, Δ \Delta Δ 为组距,分子统计落入 t r t_r tr 邻域的点数,第一项是区间的概率和(分母的 2 是因为对于 s i j ∈ [ t x , t x + 1 ] s_{ij}\in[t_x,t_{x+1}] sij[tx,tx+1],会分别在考虑 p ( t x ) p(t_x) p(tx) p ( t x + 1 ) p(t_{x+1}) p(tx+1) 被考虑,相当于被算了两次),除以组距得到密度。
由 [3] 可以看到,这种估计可以改写成核函数的形式:
h r = 1 ∣ S ∣ ∑ ( i , j ) 1 2 ⋅ 1 ( ∣ s i j − t r ∣ Δ ≤ 1 ) (b) h_r=\frac{1}{|S|}\sum_{(i,j)}\frac{1}{2}\cdot1(\frac{|s_{ij}-t_r|}{\Delta}\leq1)\tag{b} hr=S1(i,j)211(Δsijtr1)(b)
其中所用的核可以看成: K Δ ( x − t r ) = 1 2 ⋅ 1 ( ∣ x − t r ∣ Δ ≤ 1 ) K_{\Delta}(x-t_r)=\frac{1}{2}\cdot1(\frac{|x-t_r|}{\Delta}\leq1) KΔ(xtr)=211(Δxtr1)。于是一种自然的扩展就是:用别的核函数。文中用了 triangular kernel[8],公式 (2) 可以改写成:
δ i , j , r = ( 1 − ∣ s i j − t r ∣ Δ ) ⋅ 1 ( ∣ s i j − t r ∣ Δ ≤ 1 ) (c) \delta_{i,j,r}=(1-\frac{|s_{ij}-t_r|}{\Delta})\cdot1(\frac{|s_{ij}-t_r|}{\Delta}\leq1)\tag{c} δi,j,r=(1Δsijtr)1(Δsijtr1)(c)
于是公式 (1) 就是:
h r = 1 ∣ S ∣ ∑ ( i , j ) ( 1 − ∣ s i j − t r ∣ Δ ) ⋅ 1 ( ∣ s i j − t r ∣ Δ ≤ 1 ) (d) h_r=\frac{1}{|S|}\sum_{(i,j)}(1-\frac{|s_{ij}-t_r|}{\Delta})\cdot1(\frac{|s_{ij}-t_r|}{\Delta}\leq1)\tag{d} hr=S1(i,j)(1Δsijtr)1(Δsijtr1)(d)
和 (b) 式对比着看,换这个核使得 h r h_r hr 里带有 s i j s_{ij} sij,从而可以回传梯度。
可以验证这个 { h r } \{h_r\} {hr} 序列是一个合法的概率分布:易知 0 < h r ≤ 1 00<hr1,而要计算 ∑ r = 1 R h r \sum_{r=1}^Rh_r r=1Rhr,考虑到对于 s i j ∈ [ t k , t k + 1 ) s_{ij}\in[t_k,t_{k+1}) sij[tk,tk+1),它会分别在 h k h_k hk 时贡献 t k + 1 − s i j Δ \frac{t_{k+1}-s_{ij}}{\Delta} Δtk+1sij、在 h k + 1 h_{k+1} hk+1 时贡献 s i j − t k Δ \frac{s_{ij}-t_k}{\Delta} Δsijtk,所以:
∑ r = 1 R h r = 1 ∣ S ∣ ∑ ( i , j ) [ s i j − t k Δ + t k + 1 − s i j Δ ] = ∑ ( i , j ) 1 ∣ S ∣ = 1 \begin{aligned}\sum_{r=1}^Rh_r&=\frac{1}{|S|}\sum_{(i,j)}[\frac{s_{ij}-t_k}{\Delta}+\frac{t_{k+1}-s_{ij}}{\Delta}] \\ &=\sum_{(i,j)}\frac{1}{|S|}=1 \end{aligned} r=1Rhr=S1(i,j)[Δsijtk+Δtk+1sij]=(i,j)S1=1

Histogram Loss

最终对 (a) 式的估计就写成文中公式 (4),即 histogram loss:
L = ∑ r = 1 R ( h r − ∑ q = 1 r h q + ) L=\sum_{r=1}^R(h_r^-\sum_{q=1}^rh_q^+) L=r=1R(hrq=1rhq+)

Code

  • tensorflow 1.12
#import tensorflow as tf
def cos(X, Y=None):
	"""C(i,j) = cos(Xi, Yj)"""
    X_n = tf.math.l2_normalize(X, axis=1)
    if (Y is None) or (X is Y):
        return tf.matmul(X_n, tf.transpose(X_n))
    Y_n = tf.math.l2_normalize(Y, axis=1)
    return tf.matmul(X_n, tf.transpose(Y_n))


def sim_mat(label, label2=None):
    """S[i][j] = 1 <=> i- & j-th share at lease 1 label"""
    if label2 is None:
    	label2 = label
    return tf.cast(tf.matmul(label, tf.transpose(label2)) > 0, "float32")


def histogram_loss(X, L, R=151):
    """histogram loss
    X: [n, d], feature WITHOUT L2 norm
    L: [n, c], label
    R: scalar, num of estimating point, same as the paper
    """
    delta = 2. / (R - 1)  # step
    # t = (t_1, ..., t_R)
    t = tf.lin_space(-1., 1., R)[:, None]  # [R, 1]
    # gound-truth similarity matrix
    M = sim_mat(L)  # [n, n]
    # cosine similarity, in [-1, 1]
    S = cos(X)  # [n, n]

    # get indices of upper triangular (without diag)
    S_hat = S + 2  # shift value to [1, 3] to ensure triu > 0
    S_triu = tf.linalg.band_part(S_hat, 0, -1) * (1 - tf.eye(tf.shape(S)[0]))
    triu_id = tf.where(S_triu > 0)

    # extract triu -> vector of [n(n - 1) / 2]
    S = tf.gather_nd(S, triu_id)[None, :]  # [1, n(n-1)/2]
    M_pos = tf.gather_nd(M, triu_id)[None, :]
    M_neg = 1 - M_pos

    scaled_abs_diff = tf.math.abs(S - t) / delta  # [R, n(n-1)/2]
    # mask_near = tf.cast(scaled_abs_diff <= 1, "float32")
    # delta_ijr = (1 - scaled_abs_diff) * mask_near
    delta_ijr = tf.maximum(0, 1 - scaled_abs_diff)

    def histogram(mask):
        """h = (h_1, ..., h_R)"""
        sum_delta = tf.reduce_sum(delta_ijr * mask, 1)  # [R]
        return sum_delta / tf.maximum(1, tf.reduce_sum(mask))

    h_pos = histogram(M_pos)[None, :]  # [1, R]
    h_neg = histogram(M_neg)  # [R]
    # all 1 in lower triangular (with diag)
    mask_cdf = tf.linalg.band_part(tf.ones([R, R]), -1, 0)
    cdf_pos = tf.reduce_sum(mask_cdf * h_pos, 1)  # [R]

    loss = tf.reduce_sum(h_neg * cdf_pos)
    return loss

References

  1. (paper)Learning Deep Embeddings with Histogram Loss
  2. (code)madkn/HistogramLoss
  3. 什么是核密度估计?如何感性认识?
  4. 2.8. 概率密度估计(Density Estimation)
  5. 核密度估计
  6. 非参数方法——核密度估计(Kernel Density Estimation)
  7. 核密度估计Kernel Density Estimation(KDE)概述 密度估计的问题
  8. Kernel (statistics)
  9. 核函数
  10. 《Learning Deep Embeddings with Histogram Loss》笔记

你可能感兴趣的:(机器学习,数学)