Efficient in-memory extensible inverted file

Abstract

The growing amount of on-line data demands efficient parallel and distributed indexing mechanisms to manage large resource requirements and unpredictable system failures. Parallel and distributed indices built using commodity hardware like personal computers (PCs) can substantially save cost because PCs are produced in bulk, achieving the scale of economy. However, PCs have limited amount of random access memory (RAM) and the effective utilization of RAM for in-memory inversion is crucial. This paper presents an analytical investigation and an empirical evaluation of storage-efficient inmemory extensible inverted files, which are represented by fixed- or variable-sized linked list nodes. The size of these linked list nodes is determined by minimizing the storage wastes or maximizing storage utilization under different conditions, which lead to different storage allocation schemes. Minimizing storage wastes also reduces the number of address indirections (i.e., chaining). We evaluated our storage allocation schemes using a number of reference collections. We found that the arrival rate scheme is the best in terms of both storage utilization and the mean number of chainings per term. The final storage utilization can be over 90% in our evaluation if there is a sufficient number of documents indexed. The mean number of chainings is not large (less than 2.6 for all the reference collections). We have also showed that our best storage allocation scheme can be used for our extensible compressed inverted file. The final storage utilization of the extensible compressed inverted file can be over 90% in our evaluation provided that there is a sufficient number of documents indexed. The proposed storage allocation schemes can also be used by compressed extensible inverted files with word positions

Keywords: Information retrieval; Indexing; Optimization



1. Introduction

As more and more data are made available on-line, it becomes increasingly difficult to manage a single inverted file. This difficulty arises from the substantial resource requirement for large-scale indexing and from the long indexing time, making the system vulnerable to unpredictable system failures. For examples, the very large collection (VLC) from TREC [1] requires 100Gb of storage and TREC terabyte track requires 426Gb [2]. The WebBase repository [3] requires 220Gb, estimated to be only 4% of the indexable web pages. The volume of high writing quality, non-English content is also increasing. In the near future, Japanese patent data from NTCIR [4] may be as large as 160Gb. One way to manage such large quantities of data is to create the index by merging smaller indices, which are built using multiple machines indexing different document subsets in parallel [5]. This would limit the impact of system failure to individual machines and increase indexing speed.

Acquiring the computing machines in bulk using commodity hardware substantially reduces monetary costs. Also, commodity hardware, like personal computers (PCs), makes in-memory inversion an attractive proposition because random access memory (RAM) for the PC market is relatively cheap and fast, and because RAM has the potential to be upgraded later at lower prices (e.g. DDR-300 RAM to DDR-400 RAM). However, PCs can only hold a relatively small amount of RAM (i.e., 4Gb) compared with mainframe computers. Efficient RAM utilization becomes an important issue for in-memory inversion using a large number of PCs because typically the entire inverted index cannot be stored in RAM due to the large volume of data on-line. Instead, the inverted file is typically built in relatively small batches [6] and [7]. For each batch, a partial index is built and held in RAM, which is written out as a run on disk. The runs are then merged as the final inverted file.

Efficient RAM utilization can reduce indexing time by reducing the number of inverted files for merging because efficient RAM utilization enables more documents to be indexed per run. During updates, temporary indices are maintained in memory, and then integrated into the main inverted file in batches. Lester and Zobel [8] showed that the amortized time cost to integrate the temporary index with the main inverted file is reduced for different inverted file maintenance methods (i.e., re-build, re-merge and in-place methods) when more documents are indexed. Therefore, during both initial construction and update, making better use of memory resources can reduce overall costs. With the potential to better balance system resource utilization as indexing is made memory-intensive whereas loading/flushing data are made disk- or network-intensive, efficient in-memory inversion is crucial in index construction.

Our major contribution of this paper is in enhancing existing simple-to-implement single-pass in-memory inversion to be storage-efficient for creating partial inverted files and/or temporary index by developing novel storage-efficient allocation schemes that predict the needed storage with minimal storage wastes. The partial index created by our in-memory inversion can be merged with the main or other partial inverted files. The temporary index created by our in-memory inversion can also be searched during the time that the temporary index is being built. This reduces the latency of the availability of the recently indexed documents for searching and this is important for certain applications (e.g. searching recently available news articles).

An evaluation was carried out to determine which of our storage allocation schemes is the best and whether the results are comparable to existing methods (Section 6). The evaluation was carried out using 3.5Gb of test data from the VLC. The best allocation scheme was the arrival rate scheme, which achieved 95% final storage utilization for this VLC dataset. To ascertain the generality of the results, various independent datasets for both English (TREC-2, TREC-6 and TREC-2005) and Chinese (NTCIR-5) are also used to evaluate the best storage allocation scheme. We also showed that the indexing speed of our best storage allocation is similar to the indexing speed of the reported results by others [6] and [9].

The rest of this paper is organized as follows. Section 2 discusses our extensible inverted file structures, the modifications of our extensible inverted file to incorporate compressed postings and word positions, and the related storage wastes. This section also provides the rationale behind the choice of our data structure for our storage allocation schemes and the rationale behind the need to optimize storage wastes of our storage allocation schemes. Section 3 describes the first approach to determine optimal node sizes using a stepwise optimization strategy. This approach results in three related storage allocation schemes. Section 4 discusses the second approach which determines the optimal nodes size that minimizes the asymptotic worst-case storage waste per period for individual terms. Section 5 evaluates these storage allocation schemes and discusses the best scheme in terms of storage utilization, the mean number of chainings, robustness in performance and indexing speed. This section also shows that our storage allocation schemes can be used for allocating nodes to store compressed postings using the best storage allocation scheme and variable byte compression as an example. Section 6 discusses the related work and describes how the storage allocation schemes can predict our extended inverted file that incorporates compressed postings and word positions. Section 7 is the concluding section.

2. Extensible inverted file and storage wastes

This section describes the structure of the extensible inverted file and the related considerations for handling compressed postings and word position information. This section also discusses the storage wastes of the extensible inverted file and the rationale to optimize them, as well as the rationale for using the variable-size linked list data structure.

2.1. Extensible inverted file

An in-memory inverted file can be considered as a set of inverted lists, which are implemented as linked lists. Each linked list node holds a set of (basic) postings of the form left angle bracketdi, tf(di,tk)right-pointing angle bracket where each basic posting consists of a document identifier di and the within-document term frequency tf(di, tk) of the kth term in the ith document. The rest of this paper assumes that unless otherwise indicated, all postings are basic postings. If a linked list node can hold a variable number of postings, then two additional fields of information other than postings are stored in each node, namely a node size variable and an extension pointer. The node size variable specifies the amount of storage allocated for the current node and the extension pointer facilitates the chaining of nodes.

Fig. 1 shows the conceptual structure of the extensible inverted file, implemented using a set of variable-size linked list nodes. The dictionary data structure holds the set of index terms and the start address of the corresponding variable-size linked list. A new node is allocated whenever the linked list of the kth index term tk is full (e.g., the linked list of the index term “Network” in Fig. 1) and when a new posting for tk arrived. The size of the new node is determined using one of the storage allocation schemes discussed in the next two sections. If the linked list nodes hold a fixed number of postings per node, then the node size variable can be discarded, saving storage space.


Display Full Size version of this image (41K)

Fig. 1.The conceptual structure of our extensible inverted file, represented as variable-sized nodes. The start and last pointers point to, respectively, the first and last linked list nodes of the inverted list.

Each dictionary entry for an index term has a start pointer and a last pointer, which, respectively point to the beginning of the inverted list and the last linked list node of the inverted list. The last pointer reduces the traversal of the linked lists when a new posting for the index term is inserted. In this case, during insertion, the last linked list node needs to be exclusively locked to maintain the integrity of the data for concurrent access [10]. To reduce memory usage, the start pointers can be stored in a file since start pointers are used only for retrieval and not for inserting new postings. For clarity of presentation, each dictionary entry may contain additional information (e.g. document frequency) not shown in Fig. 1. In particular, each dictionary entry should hold a variable, say mpos, which indicates the position of the unfilled portion of the last node, to improve the posting insertion speed.

The extensible inverted file can support storing a special type of posting for block addressing inverted files [9] and [11] that index a fixed-size text block instead of variable-size documents. This special type of posting, called block-address posting in this paper, has only di field without the term frequency tf(di, tk) field of the basic posting where di is the block identifier instead of the document identifier and tk is the kth term. Our storage waste optimization discussed in Sections 3 and 4 can minimize the storage wastes of the nodes that store basic postings or block-address postings because the storage of these postings are constants (i.e., c1) in our storage waste optimization.

The extensible inverted file can support storage of compressed postings [12], as well as word positions. For compressed postings (e.g., γ [13] or variable byte compression [14]), each dictionary entry keeps track of the bit position (again using mpos) or the byte position of the unfilled portion of the last node. The new compressed posting is inserted at mpos as a bit/byte string. If the last node does not have enough memory for a new compressed posting, then the new compressed posting is split between the last node and the newly allocated node.

There are two general approaches to storing postings with word positions. One approach stores the word positions in the nodes and one way (as in [6]) to do this is to store the posting followed by a sequence of word positions, i.e. left angle bracketdi,tf(di,tk)right-pointing angle bracket.left angle bracketpos(di,tk,1),…,pos(i,tk,tf(di,tk))right-pointing angle bracket where di is the identifier of the ith document, tf(di, tk) is the within-document frequency of the kth term in the ith document, pos(di,tk,x) is the xth word position of the kth term in the ith document. In this case, the node size includes the storage to hold word positions as well as the postings. Another approach stores extended postings of the form left angle bracketdi, tf(di,tk), f(di,tk)right-pointing angle bracket, where f(di, tk) is the file position in the auxiliary file that stores the sequence of word position of the kth term in the ith document. In this approach, the word positions are stored in an auxiliary file. Whenever a new extended posting is added, the last position of the auxiliary file is stored as f(di, tk) of this new extended posting. The word positions of the term associated with the new extended posting are appended sequentially to the auxiliary file. These word positions in the auxiliary file can be compressed, for example, using differential coding compression [9], [12], [13], [14], [15] and [16]. If the within-document term frequency is one (i.e. tf(di, tk)=1), then f(di, tk) of the extended posting can directly store the single word position of the kth term in the ith document, saving both storage and access time. For both approaches that store word positions, the storage allocation schemes can be modified to determine the node sizes, as discussed in Section 5.

2.2. Rationale for variable-size linked lists

The variable-size linked-list data structure is chosen here because it is used in many in-memory inverted files (sometimes they are called buckets or fixed list blocks) and because it is relatively simple to implement and to analyze. Instead of linked lists, postings can be stored in RAM blocks that can expand to hold more postings when they are inserted. This type of RAM block expansion may involve copying and moving data chunks. Our work can be considered as extending the RAM block approach where the block is pre-allocated with storage to hold the expected amount of postings so that it is not necessary to copy or move data chunks in the RAM blocks. This pre-allocation avoids memory fragmentation and the difficulty is shifted in predicting the expected number postings for each term instead of using advanced storage allocators. If a fast storage allocator is used so that the allocation time is amortized to be a constant, then the storage utilization may be sacrificed. Instead of using advanced storage allocators, dynamic data structures like hash tables, skip lists and balanced trees can be used that support deletion of postings, as well as insertion of postings. However, the storage utilizations of these dynamic data structures are typically low (i.e., no more than 60% if each node contains at least one 4-byte pointer and one 6-byte posting). It is possible to store multiple postings in these data structures. In this case, the problem of optimizing the storage wastes per node re-appears whether one is dealing with a dynamic data structure (e.g., balanced trees) or a variable-size linked list. Therefore, we propose to use variable-size linked lists in this study because they are simple to program, use simple and fast storage allocators, are commonly used by in-memory inverted files, can easily be adopted to store compressed postings and can be optimized for storage waste in the same way as other dynamic data structures (e.g. balanced trees).

Our choice of using linked lists to store (compressed) postings implies that our extensible inverted files are designed largely for append-only operations where direct deletions and direct modifications can be avoided. Deletions can be done indirectly by filtering document identifiers that are known to be invalid (or deleted) from a registry of stale documents [8] and [17] because search engines can trade-off data consistency with availability [17] according to the CAP theorem [18] and [19]. It is expected that there will be little deletions or modifications during the in-memory inversion because the incoming documents are relatively fresh. On the other hand, deletions or modifications may occur more often than during in-memory inversion when the inverted file is transferred to disks. Since disk storage cost per byte is much cheaper than RAM, deletions by filtering document identifiers are practical solutions for large-scale search engines. Similar to deletions, modifications can be implemented effectively as a deletion operation (implemented as filtering) followed by a document insertion. When the registry of stale document identifiers is becoming large or when the temporary index is full, the main inverted file on disk can be maintained by re-building, re-merging or in-place updating approaches [8]. Therefore, the choice of an append-only data structure, like the variable-size linked lists, may not be a severe handicap.

The use of the variable-size linked list representation of inverted lists requires some consideration on how to merge partial inverted files as follows. The first approach saves the in-memory inverted lists that are represented as linked lists as contiguous lists on disk. This requires the system to follow the extension pointers of the linked list nodes when transferring the in-memory inverted file to disk. Tracing all the extension pointers incurs some additional time due to cache misses. However, most of the time cost is due to transferring data to disks provided that the mean number of chainings per term is not large. Once the in-memory inverted file is transferred to disk as a set of contiguous lists on disk, the conventional inverted file-merging algorithm can be used to merge these partial inverted file on disk. An alternative approach dumps the in-memory inverted file onto disk as it is. During the first level of partial inverted file merging, the merging algorithm combines the two inverted lists on disk, represented by two sets of linked lists, into a single contiguous inverted list on disk. The merged partial inverted file has a set of contiguous inverted lists on disk and it can be merged subsequently with other merged partial inverted file using conventional merging algorithm. However, following extension pointers on disk requires file seek operations that incur more time than a cache miss. Therefore, we prefer the first approach because the time cost of cache miss is less that of a file seek and because this approach is applicable to the inverted file maintenance methods mentioned by Lester and Zobel [8] (i.e., re-build, re-merge and in-place updates) using an in-memory temporary index (called the document buffer in [8]).

2.3. Rationale for storage waste optimization

The success of representing inverted lists by linked lists is based on the ability to accurately predict the required storage needed so that the RAM storage utilization is maximized. Otherwise, if final storage utilization is low (say 60%), other data structures that can support deletion should be used instead. The storage utilization of the extensible inverted file is the ratio of the total storage P of all (compressed) postings and the total storage (i.e., P+S, where S is the total storage waste). Maximization of the storage utilization U can be considered as the minimization of storage wastes of the extensible inverted file as follows:


since P is treated as a constant which is fixed for a particular data collection.

Storage wastes in the extensible inverted file can be divided into two types:

(a) The first type of storage waste is the storage overhead ε that includes the storage for extension pointers and for the node size variables; and

(b) The second type of storage waste is the latent free storage, which has been allocated but is yet unfilled with posting information. If this type of latent free storage is not considered as storage wastes, then the optimal linked-list node size would be as large as possible so that the overhead will appear minimal in the total storage allocated to that node.

The storage waste of each node in the extensible inverted file is the sum of these two types of storage wastes of the node.

There are many advantages to optimize storage wastes. First, it maximizes the storage utilization that can reduce the number of inverted file merge operation and can reduce the amortized time cost of inverted file maintenance [8]. Second, it can indirectly reduce the number of chainings per term. This can reduce (a) the time to search the temporary index when it is built on the fly, (b) the time to store the inverted lists in RAM as lists without chaining on disks when the partial inverted file is transferred to disk and (c) the number of file seeks when merging two partial inverted files on disk if these inverted files represent inverted lists as linked lists. Third, the analysis of optimizing the storage wastes can be applied to not just linked lists but to other dynamic data structures (e.g. balanced trees) where each node holds more than one (compressed) posting. In this case, the optimization analysis of these dynamic data structures treats the storage waste of each node of these dynamic structures as a constant ε with a different value.

3. Stepwise storage allocation approach

The stepwise storage allocation approach determines the optimal node size for the incoming documents based on statistics of the current set of indexed documents. This approach optimizes the expected worst-case storage waste E(W(S(N))) after N documents are indexed as follows:


(1)
where E(.) is the expectation operator, W(.) returns the worst-case function of its argument, S( n) is the storage waste after indexing n documents. The reason for optimizing the expected storage waste is to minimize the area under the storage waste curve against the number of documents indexed so that the storage waste is kept to a minimum for the different number of documents indexed (up to N documents). This approach assumes that the optimal node size for N documents is close to the optimal node size for NN documents, where Δ N is a small quantity compared with N, implying that this assumption is true when N is sufficiently large. Also, this approach assumes that the measured system parameters for determining the optimal node size should be smooth without large discontinuities. Otherwise, parameters (e.g. the size of the vocabulary) obtained after indexing N documents may vary substantially, leading to drastic changes in the optimal node size, implying that we cannot predict the optimal node size based on past statistics.

This approach has three related storage allocation schemes. The first storage allocation scheme determines the optimal node size after indexing N documents, which is the same as the optimal node size for a static collection of N documents. This scheme is called the fixed-sized node scheme (F16). The next storage allocation scheme is called the vocabulary growth rate (VGR) scheme. It extends the formula of the F16 scheme by determining the optimal node size based on the parameter values extracted at the time when a new node is allocated. The assumption is that this optimal node size remains more or less the same between the time that the node is allocated and the time that the node is filled (i.e., the system behavior should be smooth). Unfortunately, the VGR scheme allocates the same optimal node size at a given time instance for common terms and non-common terms, which are known to have widely different number of postings and different desirable node sizes. Thus, the final storage allocation scheme, called the term growth rate (TGR) scheme, determines the optimal node size for individual terms. The first two schemes optimize the expected worst-case storage waste E(W(S(N))) for all terms (as in Eq. (1)). The TGR scheme optimizes the expected worst-case storage waste E(W(S(N, tk))) for the kth term after indexing N documents where S(n, tk) is the storage waste of the kth term after indexing n documents. Similar to Eq. (1), the quantity E(W(S(N, tk))) is defined as follows:


(2)

3.1. Fixed-sized node scheme (F16)

The fixed-sized node storage allocation scheme allocates storage to hold B postings for each new node. The overhead εp of a node allocated by this scheme is the storage for the extension pointer. Assuming that each posting occupies c1 bytes, the node requires c1× B+εp bytes. The storage waste S(n, tk) for term tk after indexing n documents is the latent free storage of the last node plus the storage overhead of all the chained nodes. The latent-free storage of the last node is c1(left ceiling(df(n,tk)/B)right ceilingB-df(n,tk)), where df(n, tk) is the number of documents that contain the kth term, and left ceiling.right ceiling is the ceiling function. The storage overhead of all the chained nodes (including the new node) is due to the extension pointer and this overhead is εpleft ceiling(df(n,tk)/B)right ceiling (including the last unfilled node). The storage waste S(n, tk) for term tk is


The relative frequency estimate of the probability p( tk) that the kth term appeared in a document is df(n, tk)/n. Hence, df(n, tk)=p(tk)n. The above storage waste for n indexed documents can be rewritten as

where D( n) is the number of unique terms after indexing n documents. Since it is hard to optimize the closed form of S( n) that has discontinuous functions (e.g., ceiling function), the upper bound and the lower bound of the optimal node sizes are considered as follows. An upper bound W( S( n)) of S( n) is the storage overhead due to the extension pointers of all the chained nodes plus the latent free space where the last node is assumed to have no postings. Hence

(3)
A lower bound of S( n) is the total storage of the extension pointers and this bound assumes that there is no latent free space in the last node. Therefore, the lower bound of S( n) is

The above two bounds only differ by the amount of latent free space and the storage of one extension pointer. As the number of indexed document increases, these two bounds converge to S( n) since the latent free space becomes small compared with the storage overhead of the set of chained filled nodes. Thus, the optimal node sizes of the upper and lower bounds are valid approximations of S( n) for large collections.

By disregarding the storage waste due to the latent-free space, the lower bound of the storage waste can be approximated as


where , is called the total nominal (term) arrival rate after indexing n documents. This lower bound is not useful for the purpose of finding the optimal node size because it has no optimal value for finite values of B and it discounts the latent free storage, encouraging the undesirable allocation of larger than necessary node sizes. Alternatively, optimizing the upper bound W( S( n)) (as in Eq. (3)) of S( n) limits the storage waste after indexing n documents, as well as accounting for the latent-free storage. Based on Eq. (3), the worst-case (upper bound) storage waste after indexing n documents is . Substituting W( S( n)) into Eq. (1), the expected E( W( S( N))) worst-case storage waste after indexing N documents is .

The opti

分享到:
评论

你可能感兴趣的:(cache,REST,Scheme,Access,performance)