IR-chapter4:Index construction


hardware basics

  • cache
    the slow speed of disk
    the technique of keeping frequently accessed disk data in memory
  • seek time
    the time for the disk head moving to the part of the disk where the data are located
    0.5ms for typical disk, while 0.2 transfer time for 10M, thus storing in contiguous chunks will improve the data transfer rate.
  • buffer
    the part of memory where a block being read or written is stored
  • The processor is available to process data during disk I/O

blocked sort-based indexing

  • termID
    represent the term as termID
    do a single pass in the chapter

Reuters-RCV1 collection

  • termID-docID 0.8GB, for larger collection, it's hard to sort them in memory, thus turning to external search algorithm
  • block sort-based indexing algorithm/BSBI:
    segments the collection into parts of equal size.
    sort the termID-docID pairs of each part in memory
    stores intermedia results on disk
    merges all intermedia results into the final result.
IR-chapter4:Index construction_第1张图片
BSBI
  • parse next block: parse the documents into termID-docID pairs and accumulates the pairs in memory until the block size is full
  • invert:
    sort the termID-docID pairs
    collect all termID-docID pairs with the same termID into a postings list
  • merge blocks
IR-chapter4:Index construction_第2张图片
merge blocks

open all files simultaneously, and maintain read buffers for all blocks, and a small write buffer for the final result.In each iteration, we choose the lowest termID that has not been processed yet.All postings for the same termID are read and merged, and the merged list is read back to disk.

  • time complexity is �(T log T) because the step with the highest time complexity is sorting and T is an upper bound for the number of items we must sort (i.e., the number of termID–docID pairs).

ex4.6

single pass in-memory indexing

  • for lager collection, the data structure for mapping term to termID dose not fit into memory
  • use terms instead of termID, write each block's dictionary to disk, and then starts a new dictionary for the next block.
  • dynamic and efficient, memory-saved.


    IR-chapter4:Index construction_第3张图片
    SPIMI invert
  • The time complexity of SPIMI is �(T) because no sorting of tokens is required and all operations are at most linear in the size of the collection.
  • Some memory is wasted.

distributed indexing

  • Web search engine
  • term-partitioned index
  • MapReduce - a general architecture for distributed computing to solve large computing problem on cheap commodity machines


    IR-chapter4:Index construction_第4张图片
    MapReduce
  • master node: control the splits, assigning and reassigning
  • key-value pairs: termID - docID
  • map phase - parsers - share a consistent mapping, index- partition, write to corresponding partition segment file into the local disk
  • reduce phase- inverters- read from r segment files(only requires one-sequential read to minimize the amount of network traffic)
  • parses and inverts are not separate set of machines.
  • Google 2004 indexing system

dynamic indexing

  • The dictionary and the postings need to be modified,created or deleted.
  • Reconstruction: space cost and time delay.
  • maintain two indexes: the large main index and a small auxiliary index(kept in memory)
  • one-file-per-posting-list scheme is not feasible
  • store the index in one file T*T/n
  • logarithmatic merging T*log(T/n)
IR-chapter4:Index construction_第5张图片
logarithmatic merging
  • But query processing requires the merge of log(T) indexes now.
  • for position indexed, just more sorting.

你可能感兴趣的:(IR-chapter4:Index construction)