DSSM, Deep Semantic Similarity Model, 深度语义匹配模型,
它是基于深度神经网络的一项建模技术,可以将具有成对关系的, 不同类型的文本(e.g., < queries , documents > )投射到一个共同的低维语义空间中,进而完成后续的机器学习任务。
当语料库规模很大时, vocabulary_size 也会很大, 所以 embedding_matrix 也会很大, 进而影响网络的训练. 可以使用基于 letter n-gram 的 Word Hashing 方法, 简化网络的输入向量维度.
letter n-gram:
主要用于英文NLP.
设置一个固定长度的窗口, 以字母为单位, 以步长stride=1不断滑动窗口.
Given a word (e.g. good
), we first add word starting and ending marks to the word (e.g. #good#
). Then, we break the word into letter n-grams (e.g. letter trigrams: #go
, goo
, ood
, od#
).
Finally, the word is represented using a vector of letter n-grams.
降维效果
论文[1]中的例子, vocabulary_size=500K, letter_tri_gram_size=30K .
figure 1-2 数据集为包含relevance scale的pair, 来自论文[1]
图中的说明见下:
Illustration of the DSSM. It uses a DNN to map high-dimensional sparse text features into low-dimensional dense features in a semantic space. The first hidden layer, with 30k units, accomplishes word hashing. The word-hashed features are then projected through multiple layers of non-linear projections.
The final layer’s neural activities in this DNN form the feature in the semantic space.
见 参考[2].
详见参考[3].