全称Term Frequency–Inverse Document Frequency,TF词频,IDF逆文本频率。
博客:
文本挖掘预处理之TF-IDF
余弦相似性–维基百科
定义
待测数据集规模为 N N N.
候选集(Candidates) C = { c 1 , c 2 , . . . , c N } C=\{c_{1}, c_{2}, ..., c_{N}\} C={c1,c2,...,cN}.
参照集(References) S i = { s i 1 , s i 2 , . . . , s i M } S_{i} = \{ s_{i1}, s_{i2}, ..., s_{iM}\} Si={si1,si2,...,siM},其中 M M M表示参照集句子数量, i i i表示第 i i i个图像.
TF-IDF
下面对候选集 c i c_{i} ci计算其 n − g r a m n-gram n−gram 的TF-IDF weight.
g k ( c i ) = T F ( k ) ∗ I D F ( k ) g_{k}(c_{i}) = TF(k) * IDF(k) gk(ci)=TF(k)∗IDF(k)
T F ( k ) = h k ( c i ) ∑ h l ( c i ) TF(k) = \frac{h_{k}(c_{i})}{\sum h_{l}(c_{i})} TF(k)=∑hl(ci)hk(ci)
I D F ( k ) = l o g ( N ∑ 1 N m i n ( 1 , ∑ 1 M h k ( c i ) ) IDF(k) = log(\frac{N}{\sum_{1}^{N} min(1, \sum_{1}^{M} h_{k}(c_{i})}) IDF(k)=log(∑1Nmin(1,∑1Mhk(ci)N)
e.g.
g k ( c i ) g_{k}(c_{i}) gk(ci)表示n-gram ω k \omega_{k} ωk 的TF-IDF weight.
h k ( c i ) h_{k}(c_{i}) hk(ci)表示n-gram ω k \omega_{k} ωk 在句子 c i c_{i} ci中出现的次数.
∑ h l ( c i ) \sum h_{l}(c_{i}) ∑hl(ci)表示在数据集上所有的n-gram ω l \omega_{l} ωl 在句子 c i c_{i} ci中出现次数之和.
∑ 1 N m i n ( 1 , ∑ 1 M h k ( c i ) \sum_{1}^{N} min(1, \sum_{1}^{M} h_{k}(c_{i}) ∑1Nmin(1,∑1Mhk(ci)表示n-gram ω k \omega_{k} ωk 在数据集文本中总共出现的次数.最小值是1,最大值是N.
CIDEr
n = 1, 2, 3, 4对应n-gram的n,如1-gram,2-gram,3-gram,4-gram.
C I D E r n ( c i , s i ) = 1 M ∗ ∑ j = 1 M g n ( c i ) ∗ g n ( s i j ) ∣ ∣ g n ( c i ) ∣ ∣ ∗ ∣ ∣ g n ( s i j ) ∣ ∣ CIDEr_{n}(c_{i}, s_{i}) = \frac{1}{M}*\sum_{j=1}^M \frac{g^n(c_{i})*g^n(s_{ij})}{||g^n(c_{i})||*||g^n(s_{ij})||} CIDErn(ci,si)=M1∗j=1∑M∣∣gn(ci)∣∣∗∣∣gn(sij)∣∣gn(ci)∗gn(sij)
e.g.
g n ( c i ) g^n(c_{i}) gn(ci)表示在句子 c i c_{i} ci的所有n-gram ω k \omega_{k} ωk 的TF-IDF weight 向量.
候选集 c 1 c_{1} c1 = {我 吃 饭 了 吗},参照集 S 11 S_{11} S11 = {他 早 上 吃 饭 了},以1-gram举例,数据集数量N=1,每个参照集句子数量M=1.
n-gram ω k \omega_{k} ωk | g k ( c 1 ) g_{k}(c_{1}) gk(c1) | g k ( s 11 ) g_{k}(s_{11}) gk(s11) |
---|---|---|
吃 | 0.2 | 0.16 |
饭 | 0.2 | 0.16 |
了 | 0.2 | 0.16 |
吗 | 0.2 | 0 |
我 | 0.2 | 0 |
他 | 0 | 0.16 |
早 | 0 | 0.16 |
上 | 0 | 0.16 |
e.g.
C I D E r 1 = [ 0.2 , 0.2 , 0.2 , 0.2 , 0.2 , 0 , 0 , 0 ] 1 ∗ n T ∗ [ 0.16 , 0.16 , 0.16 , 0 , 0 , 0.16 , 0.16 , 0.16 ] n ∗ 1 0. 2 2 + 0. 2 2 + 0. 2 2 + 0. 2 2 + 0. 2 2 + 0 2 + 0 2 + 0 2 ∗ 0.1 6 2 + 0.1 6 2 + 0.1 6 2 + 0 2 + 0 2 + 0.1 6 2 + 0.1 6 2 + 0.1 6 2 = 0.5477 CIDEr_{1} = \frac{[0.2, 0.2, 0.2 ,0.2, 0.2, 0, 0, 0]^T_{1*n} * [0.16, 0.16, 0.16, 0, 0, 0.16, 0.16, 0.16]_{n*1}} {\sqrt{0.2^2+0.2^2+0.2^2+0.2^2+0.2^2+0^2+0^2+0^2} * \sqrt{0.16^2+0.16^2+0.16^2+0^2+0^2+0.16^2+0.16^2+0.16^2}} = 0.5477 CIDEr1=0.22+0.22+0.22+0.22+0.22+02+02+02∗0.162+0.162+0.162+02+02+0.162+0.162+0.162[0.2,0.2,0.2,0.2,0.2,0,0,0]1∗nT∗[0.16,0.16,0.16,0,0,0.16,0.16,0.16]n∗1=0.5477
注意,因为 N = 1 N=1 N=1所以,理论上每个n-gram的 I D F ( k ) = 0 IDF(k)=0 IDF(k)=0,但是为了避免这种情况,令 I D F ( k ) = 1 IDF(k)=1 IDF(k)=1.
code放在了我的github上CaptionMetrics,参照coco release的metrics.