IR-chapter6: sorting, term weighting and the vector space model

  • motivation: to rank-order the documents matching a query by giving a score to each (query,document) pair

parametric and zone indexes

  • index and retrieve documents by metadata.
  • parametric index vs zone index: fixed vocabulary, whatever vocabulary from the text of that zone.
IR-chapter6: sorting, term weighting and the vector space model_第1张图片
parametric search
IR-chapter6: sorting, term weighting and the vector space model_第2张图片
zone index
IR-chapter6: sorting, term weighting and the vector space model_第3张图片
zone index
  • weighted zone scoring
  • learning weights
  • the optimal weight g
    machine learning algorithm

term frequency and weighting

  • intuition: scores relate to term frequency, but are all words equally important?
  • free text query: document - the set of weights, bag of words model
    score = the sum of all terms
  • inverse document frequency
  • tf-idf weighting
    terms with lower document frequency weigh higher
tf-idf

the vector space for scoring

  • dot products : similarity between two documents
    the magnitude of the vector difference? the effect of document length.
IR-chapter6: sorting, term weighting and the vector space model_第4张图片
cosine similarity
length-normalize cosine similarity
  • query as vectors
    computation is expensive

  • computing vector scores

IR-chapter6: sorting, term weighting and the vector space model_第5张图片
basic algorithm

Variant tf–idf functions

IR-chapter6: sorting, term weighting and the vector space model_第6张图片
SMART notation for tf–idf variants.
  • Pivoted normalized document length
    the relationship between document length and relevance
IR-chapter6: sorting, term weighting and the vector space model_第7张图片
Pivoted normalized document length

linear model
machine learning techniques

你可能感兴趣的:(IR-chapter6: sorting, term weighting and the vector space model)