【论文阅读 CIKM2011】Finding Dimensions for Queries

文章目录

    • Foreword
    • Abs
    • Method
      • List Extraction
      • List Weighting
      • List Clustering
      • Dimension and Item Ranking

Foreword

  • This paper is from CIKM2011, so we only condier the method, not results
  • There are many papers that have not been shared. More papers can be found in: ShiyuNee/Awesome-Conversation-Clarifying-Questions-for-Information-Retrieval: Papers about Conversation and Clarifying Questions (github.com)

Abs

We address the problem of finding multiple groups of words or phrases that explain the underlying query facets, which we refer to as query dimensions. We assume that the important aspects of a query are usually presented and repeated in the query’s top retrieved documents in the style of lists, and query dimensions can be mined out by aggregating these significant lists.

we propose aggregating frequent lists within the top search results to mine query dimensions and implement a system called QDMiner.

Method

QDMiner discovers query dimensions by aggregating frequent lists within the top results.

  • Important information is usually organized in list formats by websites
    • Listing is a graceful way to show parallel knowledge or items
  • Important lists are commonly supported by relevant websites and hence repeat in the top search results, whereas unimportant lists just infrequently appear in results.

Query dimensions are mined by the following four steps:

  • List Extraction: Several types of lists are extracted from each document
  • List Weighting: All extracted lists are weighted, and thus some unimportant or noisy lists can be assigned by low weights
  • List Clustering: Similar lists are grouped together to compose a dimension.
  • Dimension and Item Ranking: Dimensions(between dimensions) and their items(with a dimension) are evaluated and ranked based on their importance.

List Extraction

For each document, we extract a set of lists from the HTML content of d d d based on three different types of patterns

  • Free text patterns:

    • pattern: item{, item}*(and|or) {other} item

      Example 1 We shop for gorgeous watches from Seiko, Bulova, Lucien Piccard, Citizen, Cartier or Invicta

    • further use pattern: {ˆitem (: |-) .+$}+ to extract lists from some semi-structured paragraphs

      Example 2 … are highly important for following reasons: Consistency - every fact table is filtered consistently res… Integration - queries are able to drill different processes … Reduced development time to market - the common dimensions are available without recreating the wheel over again.

  • HTML tag patterns:

    • style of HTML tags
      • SELECT: extract all text from their child tags(OPTION) to create a list
      • UL / OL: extract text within their child tags(LI)
      • TABLE: extract one list from each column or each row

【论文阅读 CIKM2011】Finding Dimensions for Queries_第1张图片

  • Repeat region patterns:

    【论文阅读 CIKM2011】Finding Dimensions for Queries_第2张图片

    • First detect repeat regions in webpages based on vision-based DOM trees

    • Then extract all leaf HTML nodes within each block, and group them by their tag names(name, rating, etc) and display styles.

    • Last, for each group, extract all text from its nodes as a list

    Note: we do post-processing for each extracted list

List Weighting

【论文阅读 CIKM2011】Finding Dimensions for Queries_第3张图片

This type of lists are useless for finding dimensions and we should punish them.

we propose to aggregate all lists of a query, and evaluate the importance of each unique list l by the following components:

  • document matching weight: S D O C = ∑ d ∈ R ( s d m ∗ s d r ) S_{\mathrm{DOC}}=\sum_{d \in R}\left(s_d^m * s_d^r\right) SDOC=dR(sdmsdr)

    • d d m d_d^m ddm is the percentage of items contained in d d d
      • s d m = ∣ d ∩ l ∣ ∣ l ∣ s_d^m=\frac{|d \cap l|}{|l|} sdm=ldl
    • s d r s_d^r sdr measures the importance of document d d d
      • s d r = 1 / ran ⁡ k d s_d^r=1 / \sqrt{\operatorname{ran} k_d} sdr=1/rankd
      • The higher d d d​ is ranked, the larger its score s d r s_d^r sdr is.( d d d is more relevant to the query)
  • average invert document frequency(IDF) of items:

    • A list comprised of common items in a corpus(we use ClueWeb09) is not informative to the query.

The importance of a list l l l: S l = S D O C ∗ S I D F S_l = S_{DOC} * S_{IDF} Sl=SDOCSIDF

List Clustering

Two lists can be grouped together if they share enough items

  • d c ( c 1 , c 2 ) = max ⁡ l 1 ∈ c 1 , l 2 ∈ c 2 d l ( l 1 , l 2 ) = 1 − ∣ l 1 ∩ l 2 ∣ min ⁡ { ∣ l 1 ∣ , ∣ l 2 ∣ } d_c\left(c_1, c_2\right)=\max _{l_1 \in c_1, l_2 \in c_2} d_l\left(l_1, l_2\right) = 1-\frac{\left|l_1 \cap l_2\right|}{\min \left\{\left|l_1\right|,\left|l_2\right|\right\}} dc(c1,c2)=maxl1c1,l2c2dl(l1,l2)=1min{l1,l2}l1l2

Use a modified QT (assume that all data is equally important)clustering algorithm to group similar lists

We modify the original QT algorithm to first group highly weighted lists. The algorithm, which we refer to as WQT (Quality Threshold with Weighted data points)

Don’t use individual weighted lists as query dimensions

Dimension and Item Ranking

A good dimension should frequently appear in the top results, a dimension c c c is more important if:

  • (1) The lists in c c c are extracted from more unique websites
  • (2) the lists in c c c are more important, i.e., they have higher weights.

在这里插入图片描述

  • S l S_l Sl is the weight of a list l l l

In a dimension, the importance of an item depends on how many lists contain the item and its ranks in the lists.

在这里插入图片描述

  • e e e is a item
  • w ( c , e , s ) w(c,e,s) w(c,e,s) is the weight contributed by a website s s s
  • A v g R a n k c , e , s AvgRank_{c,e,s} AvgRankc,e,s is the average rank of e within all lists extracted from website s s s.

We only output qualified items by default in QDMiner.

  • qualified items: S e ∣ c > 1 S_{e|c} > 1 Sec>1 and S e ∣ c > ∣ S i t e s ( c ) ∣ 10 S_{e|c} > \frac{|Sites(c)|}{10} Sec>10Sites(c)

你可能感兴趣的:(信息检索,论文阅读,论文阅读)