Revisiting Open Domain Query Facet Extraction and Generation
Revisit the task of query facet extraction and generation and study various formulations of this task
Faspect
, that includes various implementations of facet extraction and generation methods in this paperWe focus on the extraction and generation of facets from the search engine result page (SERP) for a given query
training set:
The task is to train a model to return an accurate list of facets.
We can cast the facet extraction problem as sequence labeling task.
Our M θ e x t M_{\theta_{ext}} Mθext classifies each document token to B,I,O. We use RoBERTa and apply an MLP with the output dimensionality of three to each token representation of BERT.
input: [CLS] query tokens [SEP] doc tokens [SEP]
objective:
where
where p p p can be computed by applying a softmax operator to the model’s output for the x t h x^{th} xth token.
We perform facet generation using an autoregressive text generation model.
For evert query q i q_i qi we concatenate the facets in F i F_i Fi using a separation token as y i y_i yi.
The model is BART
(a Transformer-based encoder-decoder model for text generation.) and we use two variations:
variations:
only takes the query tokens and generates the facets
takes the query tokens and the document tokens for all documents in SERP (separated by [SEP]) as input and generates facet tokens one by one.
objective:
inference: perform autoregressive text generation with beam search and sampling, conditioning the probability of the next token on the previous generated tokens
we treat the facet generation task as an extreme multi-label text classification problem.
The model is RoBERTa
M θ m c l M_{\theta_{mcl}} Mθmcl
get the probability of every facet by applying a linear transformation to the representation of the [CLS] token followed by sigmoid activation
objective(binary cross-entropy):
where y i , j ′ y'_{i,j} yi,j′ is the probability of relevance of the facet f j f_j fj given the query q i q_i qi and the list of documents D i D_i Di
it can be computed by applying a sigmoid operator to the model’s output for the j t h j^{th} jth facet class
We investigate the few-shot effectiveness of largescale pre-trained autoregressive language models.
model: GPT-3
generate facets using a task description followed by a small number of examples(prompt)
Use some rules to extract facets from SERP and re-rank them.
We explore three aggregation methods: Learning to Rank, MMR diversification, Round Robin Diversification
Facet Relevance Ranking:
use a bi-encoder model to assign a score to each candidate facet for each query and re-rank them based on their score in descending order
score: use the dot product of the query and facet representations: sim( , ) = ( ) · ( ).
E: use the average token embedding of BERT pre-trained on multiple text similarity tasks. To find optimal parameter, minimize cross-entropy loss for every positive query-facet pair ( q i , f i + ) (q_i,f_i^+) (qi,fi+) in MIMICS
dataset
MMR diversification:
Round Robin Diversification:
MIMICS
: contains web search queries sampled from the Bing query logs, and for each query, it provides up to 5 facets and the returned result snippets.
MIMICS-Click
MIMICS-Manual