Inverted Index就是我们通常所说的倒排表(倒排索引)。Key是value,而值是value所在的那些行的id的集合。
还是以Quickstart为例,说一下interved index的创建过程。
代码同dictionary index和forward index。
// Initialize the index creation using the per-column statistics information
indexCreator.init(config, indexCreationInfoMap, dataSchema, totalDocs, tempIndexDir);
数据结构初始化。在这个时候, 已经知道每个column需要多少个MutableRoaringBitmap
if (config.createInvertedIndexEnabled()) {
invertedIndexCreatorMap.put(
column,
new BitmapInvertedIndexCreator(file, indexCreationInfo.getSortedUniqueElementsArray().length, schema
.getFieldSpecFor(column)));
}
iterator复位
// Build the index
recordReader.rewind();
重新遍历,对每行索引
LOGGER.info("Start building IndexCreator!");
while (recordReader.hasNext()) {
long start = System.currentTimeMillis();
GenericRow row = recordReader.next();
long stop = System.currentTimeMillis();
indexCreator.indexRow(row);
long stop1 = System.currentTimeMillis();
totalRecordReadTime += (stop - start);
totalIndexTime += (stop1 - stop);
}
根据值在dictionary index中找到下标,然后放在inverted index中。
@Override
public void indexRow(GenericRow row) {
for (final String column : dictionaryCreatorMap.keySet()) {
Object columnValueToIndex = row.getValue(column);
Object dictionaryIndex;
if (dictionaryCache.get(column).containsKey(columnValueToIndex)) {
dictionaryIndex = dictionaryCache.get(column).get(columnValueToIndex);
} else {
dictionaryIndex = dictionaryCreatorMap.get(column).indexOf(columnValueToIndex);
dictionaryCache.get(column).put(columnValueToIndex, dictionaryIndex);
}
forwardIndexCreatorMap.get(column).index(docIdCounter, dictionaryIndex);
if (config.createInvertedIndexEnabled()) {
invertedIndexCreatorMap.get(column).add(docIdCounter, dictionaryIndex);
}
}
docIdCounter++;
}
放入inverted index.
private void indexSingleValue(int entry, int docId) {
if (entry == -1) {
return;
}
invertedIndex[entry].add(docId);
}
@Override
public void seal() throws ConfigurationException, IOException {
for (final String column : forwardIndexCreatorMap.keySet()) {
forwardIndexCreatorMap.get(column).close();
if (config.createInvertedIndexEnabled()) {
invertedIndexCreatorMap.get(column).seal();
}
dictionaryCreatorMap.get(column).close();
}
writeMetadata();
}
@Override
public void seal() throws IOException {
final DataOutputStream out =
new DataOutputStream(new BufferedOutputStream(new FileOutputStream(invertedIndexFile)));
// First, write out offsets of bitmaps. The information can be used to access a certain bitmap directly.
// Totally (invertedIndex.length+1) offsets will be written out; the last offset is used to calculate the length of
// the last bitmap, which might be needed when accessing bitmaps randomly.
// If a bitmap's offset is k, then k bytes need to be skipped to reach the bitmap.
int offset = 4 * (invertedIndex.length + 1); // The first bitmap's offset
out.writeInt(offset);
for (final MutableRoaringBitmap element : invertedIndex) { // the other bitmap's offset
offset += element.serializedSizeInBytes();
out.writeInt(offset);
}
// write out bitmaps one by one
for (final MutableRoaringBitmap element : invertedIndex) {
element.serialize(out);
}
out.close();
LOGGER.debug("persisted bitmap inverted index for column : " + spec.getName() + " in "
+ invertedIndexFile.getAbsolutePath());
}
文件的前4 * (invertedIndex.length + 1)个字节存了(invertedIndex.length + 1)个整数。前invertedIndex.length个整数分别存储第1到第invertedIndex.length个value的inverted index所处的offset。最后一个整数存储整个文件的长度,方便快速计算最后一条索引的长度。
最后再说一下,inverted index用到了roaringbitmap,这是一种高度压缩的bitmap,并且性能非常好,甚至比非压缩bitmap更快,提供and、andNot、or、xor等运算。Druid也用到了这个包。Lucene也用到了相同的技术。关于RoaringBitmap可以参考这个post。