M. J. Kusner, Y. Sun, N. I. Kolkin, K. Q. Weinberger, From Word Embeddings To Document Distances, ICML (2015)
词嵌入(word embedding):根据单词在语句中的局部共存性,学习单词语义层面的表示(semantically meaningful representations for words)。
单词移动距离(Word Mover’s Distance,WMD):基于词嵌入,衡量文本文档(text documents)间距离的函数。WMD以一个文档的嵌入词移动至另一个文档的嵌入词的最小距离(the minimum amount of distance that the embedded words of one document need to “travel” to reach the embedded words of another document)作为两个文本文档间不相似度(dissimilarity)的度量。
WMD测度不包含超参数(hyperparameters)。
文档表示的最常用的两种方法:
由于各文档的BOW(TF-IDF)表示通常近似正交性(frequent near-orthogonality),二者并不适于度量文档距离;另外,二者无法表示不同单词间的距离(not capture the distance between individual words)。
文档的低维隐含变量表示(a latent low-dimensional representation of documents):
通常,语义关系体现在词向量的运算上(semantic relationships are often preserved in vector operations on word vectors),即嵌入词向量间的距离能够表示语义(distances between embedded word vectors are to some degree semantically meaningful)。本文将文本文档表示为嵌入词的加权点云(a weighted point cloud of embedded words),文本文档 A A A和 B B B间的单词移动距离(Word Mover’s Distance,WMD)定义为:为匹配(match)文档 B B B的点云(point cloud),文档 A A A中的单词(words from document A A A)所需移动(travel)的最小累积距离(minimum cumulative distance),Fig. 1。
WMD最优问题是最短测地距离(Earth Mover’s Distance,EWD)运输问题(transportation problem)的特例。本文给出几个下界距离(lower bounds)用于近似WMD或对查询范围剪枝(approximations or to prune away documents that are provably not amongst the k k k-nearest neighbors of a query)。
WMD特性:(1)无超参(hyper-parameter free);(2)可解释性强(highly interpretable),文档距离可解释为少量不同单词间的稀疏距离(the distance between two documents can be broken down and explained as the sparse distances between few individual words);(3)高检索准确率(high retrieval accuracy)。
Okapi BM25
LDA
LSI
TextTiling-EMD
Stacked Denoising Autoencoders (SDA)、mSDA
Componential Counting Grid
word2vec:词嵌入过程(word-embedding procedure),使用(浅层)神经网络语言模型(a (shallow) neural network language model)学习单词的向量表示(vector representation)。
skip-gram模型:由输入层、投影层(a projection layer)和输出层组成,用于预测相邻单词(nearby words)。通过最大化语料库(corpus)中相邻单词(neighboring words)的对数概率(log probability),训练各单词词向量(word vector),即给定单词序列(a sequence of words) w 1 , ⋯ , w T w_{1}, \cdots, w_{T} w1,⋯,wT:
1 T ∑ t = 1 T ∑ j ∈ n b ( t ) log p ( w j ∣ w t ) \frac{1}{T} \sum_{t = 1}^{T} \sum_{j \in nb(t)} \log p(w_{j} | w_{t}) T1t=1∑Tj∈nb(t)∑logp(wj∣wt)
其中, n b ( t ) nb(t) nb(t)表示单词 t t t的相邻单词集合、 p ( w j ∣ w t ) p(w_{j} | w_{t}) p(wj∣wt)表示相应词向量(associated word vectors) v w j \mathbf{v}_{w_{j}} vwj和 v w t \mathbf{v}_{w_{t}} vwt之间的层级归一化指数(hierarchical softmax)。由于结构简单和层级归一化指数,skip-gram能够使用台式机在数十亿单词上训练(due to its surprisingly simple architecture and the use of the hierarchical softmax, the skip-gram model can be trained on a single machine on billions of words per hour using a conventional desktop computer),因此能学到复杂的单词关系。
X ∈ R d × n \mathbf{X} \in \R^{d \times n} X∈Rd×n表示 n n n个单词的word2vec嵌入矩阵(a word2vec embedding matrix),其第 i i i列 x i ∈ R d \mathbf{x}_{i} \in \R^{d} xi∈Rd表示第 i i i个单词在 d d d维空间中的词嵌入。假设文本文档表示为归一化词袋模型(normalized bag-of-words,nBOW)向量 d ∈ R n \mathbf{d} \in \R^{n} d∈Rn,即如果单词 i i i出现 c i c_{i} ci次,则 d i = c i ∑ j = 1 n c j d_{i} = \frac{c_{i}}{\sum_{j = 1}^{n} c_{j}} di=∑j=1ncjci。通常,nBOW向量 d \mathbf{d} d非常稀疏(very sparse)。
n n nBOW( n n nBOW representation)
向量 d \mathbf{d} d为 n − 1 n - 1 n−1维单纯形(simplex),包含不同唯一词的两文档(different unique words)位于单纯形不同的区域中,但这两个文档的语义确可能相近(semantically close)。
词映射损失(word travel cost)
本文将单词对(individual word pairs)间的语义相似度(document distance metric)包含进文档距离度量(document distance metric)。单词不相似度通常采用在word2vec嵌入空间(the word2vec embedding space)中的欧氏距离(Euclidean distance)度量。单词 i i i和 j j j之间的距离为: c ( i , j ) = ∥ x i − x j ∥ 2 c(i, j) = \| \mathbf{x}_{i} - \mathbf{x}_{j} \|_{2} c(i,j)=∥xi−xj∥2,表示一个单词移动到另一个单词的代价(the cost associated with “traveling” from one word to another)。
文档距离(document distance)
(1)令 d \mathbf{d} d、 d ′ \mathbf{d}^{\prime} d′表示两个文档在 n − 1 n - 1 n−1维单纯形(simplex)上的 n n nBOW表示。
(2)假定 d \mathbf{d} d中的每个单词 i i i都可以全部或部分映射到 d ′ \mathbf{d}^{\prime} d′中的任意单词(each word i i i in d \mathbf{d} d to be transformed into any word in d ′ \mathbf{d}^{\prime} d′ in total or in parts)。
(3)令 T ∈ R n × n \mathbf{T} \in \R^{n \times n} T∈Rn×n表示(稀疏)流矩阵(a (sparse) flow matrix),其中 T i j ≥ 0 \mathbf{T}_{ij} \geq 0 Tij≥0表示 d \mathbf{d} d中单词 i i i到 d ′ \mathbf{d}^{\prime} d′中单词 j j j的流量(how much of word i i i in d \mathbf{d} d travels to word j j j in d ′ \mathbf{d}^{\prime} d′)。
(4)为将 d \mathbf{d} d完全转移至 d ′ \mathbf{d}^{\prime} d′, d \mathbf{d} d中单词 i i i的流出量为 d i d_{i} di,即 ∑ j T i j = d i \sum_{j} \mathbf{T}_{ij} = d_{i} ∑jTij=di;同时 d ′ \mathbf{d}^{\prime} d′中单词 j j j的流入量为 d j d_{j} dj,即 ∑ i T i j = d j \sum_{i} \mathbf{T}_{ij} = d_{j} ∑iTij=dj(to transform d entirely into d \mathbf{d} d we ensure that the entire outgoing flow from word i i i equals d i d_{i} di, i.e. ∑ j T i j = d i \sum_{j} \mathbf{T}_{ij} = d_{i} ∑jTij=di. Further, the amount of incoming flow to word j j j must match d j d_{j} dj, i.e. ∑ i T i j = d j \sum_{i} \mathbf{T}_{ij} = d_{j} ∑iTij=dj)。
则两个文档间的距离定义为:将 d \mathbf{d} d中所有单词迁移至 d ′ \mathbf{d}^{\prime} d′中的最小加权累积代价(the distance between the two documents as the minimum (weighted) cumulative cost required to move all words from d \mathbf{d} d to d ′ \mathbf{d}^{\prime} d′),即:
∑ i , j T i j c ( i , j ) \sum_{i, j} \mathbf{T}_{ij} c(i, j) i,j∑Tijc(i,j)
运输问题(transportation problem)
给定约束,将 d \mathbf{d} d移至 d ′ \mathbf{d}^{\prime} d′的最小加权累积代价为如下线性规化(linear program)的解:
min T ≥ 0 ∑ i , j = 1 n T i j c ( i , j ) subject to: ∑ j = 1 n T i j = d i , ∀ i ∈ { 1 , ⋯ , n } ∑ i = 1 n T i j = d j ′ , ∀ j ∈ { 1 , ⋯ , n } \begin{aligned} & \min_{\mathbf{T} \geq 0} \sum_{i, j = 1}^{n} \mathbf{T}_{ij} c(i, j) \\ \text{subject to:} & \\ & \sum_{j = 1}^{n} \mathbf{T}_{ij} = d_{i}, \ \forall i \in \{ 1, \cdots, n \} \\ & \sum_{i = 1}^{n} \mathbf{T}_{ij} = d_{j}^{\prime}, \ \forall j \in \{ 1, \cdots, n \} \\ \end{aligned} subject to:T≥0mini,j=1∑nTijc(i,j)j=1∑nTij=di, ∀i∈{1,⋯,n}i=1∑nTij=dj′, ∀j∈{1,⋯,n}
■ T i j ≥ 0 \mathbf{T}_{ij} \geq 0 Tij≥0■
WMD距离(word mover’s distance)即为方程(1)的解。由于 c ( i , j ) c(i, j) c(i,j)是一个测度(metric),可以证明WMD也是一个测度。
可视化(visualization)
WMD优化问题的最佳平均计算时间复杂度(best average time complexity)为 O ( p 3 log p ) \mathcal{O} (p^{3} \log p) O(p3logp),其中 p p p表示文档中唯一词(unique words)的数量(the number of unique words in the documents)。■即 p p p为 n n nBOW向量长度■
WMD运输问题的下界距离:
词质心距离(word centroid distance)
根据三角不等式(triangle inequality),文档 d \mathbf{d} d和 d ′ \mathbf{d}^{\prime} d′之间的质心距离(centroid distance) ∥ X d − X d ′ ∥ \| \mathbf{X} \mathbf{d} - \mathbf{X} \mathbf{d}^{\prime} \| ∥Xd−Xd′∥为其WMD距离的下界(lower bound),
∑ i , j = 1 n T i j c ( i , j ) ≥ ∥ X d i − X d j ′ ∥ 2 \sum_{i, j = 1}^{n} \mathbf{T}_{ij} c(i, j) \geq \| \mathbf{X} d_{i} - \mathbf{X} d_{j}^{\prime} \|_{2} i,j=1∑nTijc(i,j)≥∥Xdi−Xdj′∥2
■■
∑ i , j = 1 n T i j c ( i , j ) = ∑ i , j = 1 n T i j ∥ x i − x j ′ ∥ 2 = ∑ i , j = 1 n ∥ T i j ( x i − x j ′ ) ∥ 2 ≥ ∥ ∑ i , j = 1 n T i j ( x i − x j ′ ) ∥ 2 = ∥ ∑ i = 1 n ( ∑ j = 1 n T i j ) x i − ∑ j = 1 n ( ∑ i = 1 n T i j ) x j ′ ∥ 2 = ∥ ∑ i = 1 n d i x i − ∑ j = 1 n d j ′ x j ′ ∥ 2 = ∥ X d i − X d j ′ ∥ 2 \begin{aligned} \sum_{i, j = 1}^{n} \mathbf{T}_{ij} c(i, j) & = \sum_{i, j = 1}^{n} \mathbf{T}_{ij} \| \mathbf{x}_{i} - \mathbf{x}_{j}^{\prime} \|_{2} \\ & = \sum_{i, j = 1}^{n} \| \mathbf{T}_{ij} (\mathbf{x}_{i} - \mathbf{x}_{j}^{\prime}) \|_{2} \\ & \geq \| \sum_{i, j = 1}^{n} \mathbf{T}_{ij} (\mathbf{x}_{i} - \mathbf{x}_{j}^{\prime}) \|_{2} \\ & = \| \sum_{i = 1}^{n} \left( \sum_{j = 1}^{n} \mathbf{T}_{ij} \right) \mathbf{x}_{i} - \sum_{j = 1}^{n} \left( \sum_{i = 1}^{n} \mathbf{T}_{ij} \right) \mathbf{x}_{j}^{\prime} \|_{2} \\ & = \| \sum_{i = 1}^{n} d_{i} \mathbf{x}_{i} - \sum_{j = 1}^{n} d_{j}^{\prime} \mathbf{x}_{j}^{\prime} \|_{2} \\ & = \| \mathbf{X} d_{i} - \mathbf{X} d_{j}^{\prime} \|_{2} \\ \end{aligned} i,j=1∑nTijc(i,j)=i,j=1∑nTij∥xi−xj′∥2=i,j=1∑n∥Tij(xi−xj′)∥2≥∥i,j=1∑nTij(xi−xj′)∥2=∥i=1∑n(j=1∑nTij)xi−j=1∑n(i=1∑nTij)xj′∥2=∥i=1∑ndixi−j=1∑ndj′xj′∥2=∥Xdi−Xdj′∥2
■
由于每个文档都用其加权平均词向量表示(each document is represented by its weighted average word vector),本文称之为词质心距离(Word Centroid Distance, WCD)。WCD距离的计算时间复杂度为 O ( d p ) \mathcal{O} (dp) O(dp)(it is very fast to compute via a few matrix operations and scales O ( d p ) \mathcal{O} (dp) O(dp))。
对于最近邻(nearest-neighbor)问题,WCD能够缩小候选点范围(promising candidates),以加速WMD搜索。
WCD易于计算,但不够紧致(not very tight)。
松弛词移动距离(relaxed word moving distance)
通过放松WMD优化问题(relaxing the WMD optimization problem)并移除一个约束条件(removing one of the two constraints respectively),可以更紧致的下界(much tighter bounds)。
若移除第二个约束条件,优化问题为:
min T ≥ 0 ∑ i , j = 1 n T i j c ( i , j ) subject to: ∑ j = 1 n T i j = d i , ∀ i ∈ { 1 , ⋯ , n } \begin{aligned} & \min_{\mathbf{T} \geq 0} \sum_{i, j = 1}^{n} \mathbf{T}_{ij} c(i, j) \\ \text{subject to:} & \\ & \sum_{j = 1}^{n} \mathbf{T}_{ij} = d_{i}, \ \forall i \in \{ 1, \cdots, n \} \\ \end{aligned} subject to:T≥0mini,j=1∑nTijc(i,j)j=1∑nTij=di, ∀i∈{1,⋯,n}
由于WMD最优问题的解需要满足两个约束条件,移除一个后,解的可行域变大,因此松弛问题的解必为WMD距离的下界(this relaxed problem must yield a lower-bound to the WMD distance, which is evident from the fact that every WMD solution (satisfying both constraints) must remain a feasible solution if one constraint is removed)。
最优流矩阵 T ∗ \mathbf{T}^{\ast} T∗为:
T ∗ = { d i , if j = arg min j c ( i , j ) 0 , otherwise (2) \mathbf{T}^{\ast} = \begin{cases} d_{i}, & \text{if } j = \argmin_{j} c(i, j) \\ 0, & \text{otherwise} \end{cases} \tag {2} T∗=⎩⎨⎧di,0,if j=jargminc(i,j)otherwise(2)
令 T \mathbf{T} T为松弛问题的任意可行解(feasible solution), ∀ \forall ∀单词 i i i,其最近词为 j ∗ = arg min j c ( i , j ) j^{\ast} = \argmin_{j} c(i, j) j∗=jargminc(i,j),则
∑ j T i j c ( i , j ) ≥ ∑ j T i j c ( i , j ∗ ) = c ( i , j ∗ ) ∑ j T i j = c ( i , j ∗ ) d i = ∑ j T i j ∗ c ( i , j ) \sum_{j} \mathbf{T}_{ij} c(i, j) \geq \sum_{j} \mathbf{T}_{ij} c(i, j^{\ast}) = c(i, j^{\ast}) \sum_{j} \mathbf{T}_{ij} = c(i, j^{\ast}) d_{i} = \sum_{j} \mathbf{T}_{ij}^{\ast} c(i, j) j∑Tijc(i,j)≥j∑Tijc(i,j∗)=c(i,j∗)j∑Tij=c(i,j∗)di=j∑Tij∗c(i,j)
因此, T ∗ \mathbf{T}^{\ast} T∗必能生成最小损失(a minimum objective value)。计算该解仅需确定 j ∗ = arg min j c ( i , j ) j^{\ast} = \argmin_{j} c(i, j) j∗=jargminc(i,j)(identification),可在欧氏word2vec空间中做最近邻搜索(a nearest neighbor search in Euclidean word2vec space)。对文档 D D D中的每个词向量 x i \mathbf{x}_{i} xi,需要找到文档 D ′ D^{\prime} D′中的最相似的词向量 x j \mathbf{x}_{j} xj。
若移除第一个约束,最近邻搜索过程相反,即对文档 D ′ D^{\prime} D′中的每个词向量 x j \mathbf{x}_{j} xj,需要找到文档 D D D中的最相似的词向量 x i \mathbf{x}_{i} xi。
令两个松弛解分别为 l 1 ( d , d ′ ) l_{1} (\mathbf{d}, \mathbf{d}^{\prime}) l1(d,d′)、 l 2 ( d , d ′ ) l_{2} (\mathbf{d}, \mathbf{d}^{\prime}) l2(d,d′),通过取二者中的最大值(taking the maximum of the two),可得到更紧致的下界,称为松弛WMD(Relaxed WMD,RWMD):
l r ( d , d ′ ) = max ( l 1 ( d , d ′ ) , l 2 ( d , d ′ ) ) l_{r} (\mathbf{d}, \mathbf{d}^{\prime}) = \max \left( l_{1} (\mathbf{d}, \mathbf{d}^{\prime}), l_{2} (\mathbf{d}, \mathbf{d}^{\prime}) \right) lr(d,d′)=max(l1(d,d′),l2(d,d′))
预读取与减枝(prefetch and prune)
查找查寻文档(a query document)的 k k k近邻:
(1)根据与查寻文档的WCD距离对所有文档进行排序,并计算前 k k k个文档的WMD距离;
(2)遍历(traverse)其余文档,首先检查各文档的RWMD下界是否大于当前 k k k近邻文档的WMD距离,如果条件为真则剪枝(check if the RWMD lower bound exceeds the distance of the current k k k-th closest document, if so we can prune it);否则计算其WMD距离,并更新 k k k近邻文档。
由于RWMD近似(RWMD approximation)的极其紧致,在一些数据集上, 95 95% 95的文档能被剪枝。
比较7种文档表示基线(baseline):词袋(bag-of-words,BOW)、TFIDF(term frequency-inverse document frequency)、BM25 Okapi、LSI(Latent Semantic Indexing)、LDA(Latent Dirichlet Allocation)、mSDA(Marginalized Stacked Denoising Autoencoder)、CCG(Componential Counting Grid)
欧氏距离 k k k近邻,超参使用贝叶斯优化(Bayesian optimization)
RWMD距离与WMD距离极其接近,WCD距离与WMD距离相差较远。