https://arxiv.org/pdf/2002.06652.pdf
I. INTRODUCTION
One limitation of BERT is that due to the large model size, it is time consuming to perform sentence pair regression such as clustering and semantic search.
One effective way to solve this problem is to transforms a sentence to a vector that encodes the semantic meaning of the sentence
Currently, a common sentence embedding approach from BERT-based models is to average the representations obtained from the last layer or using the [CLS] token for sentence-level prediction.
bert模型太大,encode句子变成向量,用 [CLS]表征句子级或用最后一层represention平均数
Different from SBERT, we investigate sentence embedding by studying the geometric structure of deep contextualized models and propose a new method by dissecting BERT-based word models.
主要研究了模型几何结构
SBERTWK inherits the strength of deep contextualized models which is trained on both word- and sentence-level objectives. It is compatible with most deep contextualized models such as BERT [5] and RoBERTa [11].
继承了词和句子的level,兼容BERT RoBERT等深层语境模型
II. RELATED WORK
Traditional word embedding methods provide a static representation for a word in a vocabulary set.
First, it cannot deal with polysemy. Second, it cannot adjust the meaning of a word based on its contexts.
传统静态embedding的缺点
Sentence embedding methods can be categorized into two categories: non-parameterized and parameterized models.
Non-parameterized methods usually rely on high quality pre-trained word embedding methods. Following this line of averaging word embeddings, several weighted averaging methods were proposed, including tf-idf, SIF [21], uSIF [22] and GEM [23].
Parameterized models are more complex, and they usualy perform better than non-parameterized models
sentence embedding 两种模型
However, unlike supervised tasks, universal sentence embedding methods in general do not have a clear objective function to optimize
universal sentence embedding methods没有明确目标去训练优化
IV. PROPOSED SBERT-WK METHOD
We propose a new sentence embedding method called SBERT-WK in this section.
1) Determine a unified word representation for each word in a sentence by integrating its representations across layers by examining its alignment and novelty properties.
2) Conduct a weighted average of unified word representations based on the word importance measure to yield the ultimate sentence embedding vector.
V. EXPERIMENTS
• Semantic textual similarity tasks.
They predict the similarity between two given sentences. They can be used to indicate the embedding ability of a method in terms of clustering and information retrieval via semantic search.
• Supervised downstream tasks.
They measure embedding’s transfer capability to downstream tasks including entailment and sentiment classification.
• Probing tasks.
They are proposed in recent years to measure the linguistic features of an embedding model and provide finegrained analysis.
For performance benchmarking, we compare SBERT-WK with the following 10 different methods, including parameterized and non-parameterized models
1) Average of GloVe word embeddings;
2) Average the last layer token representations of BERT;
3) Use [CLS] embedding from BERT, where [CLS] is used for next sentence prediction in BERT; 4) SIF model [21], which is a non-parameterized model that provides a strong baseline in textual similarity tasks;
5) GEM model [23], which is a non-parameterized model deriving from the analysis of static word embedding space;
6) p-mean model [29] that incorporates multiple word embedding models;
7) Skip-Thought [24]; 8) InferSent [25] with both GloVe and FastText versions;
9) Universal Sentence Encoder [30], which is a strong parameterized sentence embedding using multiple objectives and transformer architecture;
10) SBERT, which is a state-of-the-art sentence embedding model by training the Siamese network over BERT.
用了10种方法测试
A. Semantic Textural Similarity
To evaluate semantic textual similarity, we use 2012-2016 STS datasets [31]–[35].
They contain sentence pairs and labels between 0 and 5, which indicate their semantic relatedness. Some methods learn a complex regression model that maps sentence pairs to their similarity score.
In our experiments, we do not include the representation from the first three layers since their representations are less contextualized as reported in [20]. Some superficial information is captured by those representations and they play a subsidiary role in most tasks [8].
B. Supervised Downstream Tasks
For supervised tasks, we compare SBERT-WK with other sentence embedding methods in the following eight downstream tasks.
用了8中下游任务测试
• MR: Binary sentiment prediction on movie reviews [39].
• CR: Binary sentiment prediction on customer product reviews [40].
• SUBJ: Binary subjectivity prediction on movie reviews and plot summaries [41].
• MPQA: Phrase-level opinion polarity classification [42].
• SST2: Stanford Sentiment Treebank with binary labels [43].
• TREC: Question type classification with 6 classes [44].
• MRPC: Microsoft Research Paraphrase Corpus for paraphrase prediction [45].
• SICK-E: Natural language inference dataset [36].
C. Probing Tasks
This could be attributed to that SBERT pays more attention to the sentence level information in its training objective. It focuses more on sentence pair similarities.
In contrast, the mask language objective in BERT focuses more on word- or phrase-level and the next sentence prediction objective captures the intersentence information.
Probing tasks are tested on the wordlevel information or the inner structure of a sentence.
D. Ablation and Sensitivity Study
To verify the effectiveness of each module in the proposed SBERT-WK model, we conduct the ablation study by adding one module at a time. Also, the effect of two hyper parameters (the context window size and the starting layer selection) is evaluated. The averaged results for textual semantic similarity datasets, including STS12-STS16 and STSB, are presented.
Inference time comparison of InferSent, BERT, XLNET, SBERT and SBERT-WK. Data are collected from 5 trails.
VI. CONCLUSION AND FUTURE WORK
In this work, we provided in-depth study of the evolving pattern of word representations across layers in deep contextualized models. Furthermore, we proposed a novel sentence embedding model, called SBERT-WK, by dissecting deep contextualized models, leveraging the diverse information learned in different layers for effective sentence representations. SBERT-WK is efficient, and it demands no further training. Evaluation was conducted on a wide range of tasks to show the effectiveness of SBERT-WK.