IR-chapter4:Index construction

hardware basics

cache
the slow speed of disk
the technique of keeping frequently accessed disk data in memory
seek time
the time for the disk head moving to the part of the disk where the data are located
0.5ms for typical disk, while 0.2 transfer time for 10M, thus storing in contiguous chunks will improve the data transfer rate.
buffer
the part of memory where a block being read or written is stored
The processor is available to process data during disk I/O

blocked sort-based indexing

termID
represent the term as termID
do a single pass in the chapter

Reuters-RCV1 collection

termID-docID 0.8GB, for larger collection, it's hard to sort them in memory, thus turning to external search algorithm
block sort-based indexing algorithm/BSBI:
segments the collection into parts of equal size.
sort the termID-docID pairs of each part in memory
stores intermedia results on disk
merges all intermedia results into the final result.

BSBI

parse next block: parse the documents into termID-docID pairs and accumulates the pairs in memory until the block size is full
invert:
sort the termID-docID pairs
collect all termID-docID pairs with the same termID into a postings list
merge blocks

merge blocks

open all files simultaneously, and maintain read buffers for all blocks, and a small write buffer for the final result.In each iteration, we choose the lowest termID that has not been processed yet.All postings for the same termID are read and merged, and the merged list is read back to disk.

time complexity is �(T log T) because the step with the highest time complexity is sorting and T is an upper bound for the number of items we must sort (i.e., the number of termID–docID pairs).

ex4.6

single pass in-memory indexing

for lager collection, the data structure for mapping term to termID dose not fit into memory
use terms instead of termID, write each block's dictionary to disk, and then starts a new dictionary for the next block.
dynamic and efficient, memory-saved.

SPIMI invert
The time complexity of SPIMI is �(T) because no sorting of tokens is required and all operations are at most linear in the size of the collection.
Some memory is wasted.

distributed indexing

Web search engine
term-partitioned index
MapReduce - a general architecture for distributed computing to solve large computing problem on cheap commodity machines

MapReduce
master node: control the splits, assigning and reassigning
key-value pairs: termID - docID
map phase - parsers - share a consistent mapping, index- partition, write to corresponding partition segment file into the local disk
reduce phase- inverters- read from r segment files(only requires one-sequential read to minimize the amount of network traffic)
parses and inverts are not separate set of machines.
Google 2004 indexing system

dynamic indexing

The dictionary and the postings need to be modified,created or deleted.
Reconstruction: space cost and time delay.
maintain two indexes: the large main index and a small auxiliary index(kept in memory)
one-file-per-posting-list scheme is not feasible
store the index in one file T*T/n
logarithmatic merging T*log(T/n)

logarithmatic merging

But query processing requires the merge of log(T) indexes now.
for position indexed, just more sorting.