信息检索报告整理

前言

最近听了业界大佬Maarten的一个关于IR的Talk,如果我没记错,应该和去年在ESSIR上听到的是一样的,不过每次听都有新的收获,将要整理记录如下。

Query Improvement (online)

  1. 主要的目:提供shortcut给用户、处理查询的error
  2. 主要方式:Log analysis (AOL dataset)
  3. 主要途径:
    • Query Auto-Completion (QAC): what users' intent in mind but not clearly expressed
    • Query Suggestion: recommendation, ranking & diversity
    • Query Expansion
    • Query Correction
  4. 关键在于将Query的signals,如clicks, time, news, personal, general, location等信息和query logs相结合

Getting Content (offline)

  1. Crawling中常见的问题:
    • Scale
    • Content selection
    • URL filtering
    • Remove duplicate URLs: exact & near (compare sequences of word, like n-gram words)
    • Spam detection: meaningful expressions, sentiment analysis & supervised learning
    • Aggregation: considering anchor text on the web & information among entities.
    • Inverted index construction: collect -> tokenize -> stopwords -> stem/lemma -> index
    • Temporal IR: info can be images, songs, books, news, webs, videos and apps

Query Understanding (online)

  1. The result of query understanding can be presented on search engine results page (SERP), some contexts should be considered:
    • Search goals? search tasks?
    • Semantic topics?
    • Time-sensitive? location-sensitive?
  2. Classification query based on pre-defined intent is difficult (short & ambiguous): click-though data & session data.
  3. Intent Discovery (Non-predefined)
    • Shifting intents: intents change with time (Radinsky. 2013)
    • Learning to detect intent shifting (Lefortier. 2014)
      • Queries whose intents from non-fresh to fresh
      • More clicks to some links?
  4. Diversity
    • Extrinsic: query with uncertainty
    • Intrinsic: diversity is part of info needs

Ranker (learning to rank)

  1. content-based
  2. structure-based (title, content, tags, time)
  3. based on interaction behaviors (click through, scanning)
  4. docs represented by feature vector

Responsible IR

Privacy, Fairness, Accuracy, Transparency (let the sys explain why)

你可能感兴趣的:(信息检索报告整理)