Pinot中的Inverted Index源码分析

Inverted Index就是我们通常所说的倒排表(倒排索引)。Key是value,而值是value所在的那些行的id的集合。
还是以Quickstart为例,说一下interved index的创建过程。

收集各个column的统计数据

代码同dictionary index和forward index。

Inverted index数据结构初始化

// Initialize the index creation using the per-column statistics information
indexCreator.init(config, indexCreationInfoMap, dataSchema, totalDocs, tempIndexDir);

数据结构初始化。在这个时候, 已经知道每个column需要多少个MutableRoaringBitmap

if (config.createInvertedIndexEnabled()) {
    invertedIndexCreatorMap.put(
        column,
        new BitmapInvertedIndexCreator(file, indexCreationInfo.getSortedUniqueElementsArray().length, schema
            .getFieldSpecFor(column)));
}

再次遍历,按行处理每列的索引

iterator复位

// Build the index
recordReader.rewind();

重新遍历,对每行索引

LOGGER.info("Start building IndexCreator!");
while (recordReader.hasNext()) {
  long start = System.currentTimeMillis();
  GenericRow row = recordReader.next();
  long stop = System.currentTimeMillis();
  indexCreator.indexRow(row);
  long stop1 = System.currentTimeMillis();
  totalRecordReadTime += (stop - start);
  totalIndexTime += (stop1 - stop);
}

根据值在dictionary index中找到下标,然后放在inverted index中。

@Override
public void indexRow(GenericRow row) {
    for (final String column : dictionaryCreatorMap.keySet()) {

        Object columnValueToIndex = row.getValue(column);
        Object dictionaryIndex;
        if (dictionaryCache.get(column).containsKey(columnValueToIndex)) {
            dictionaryIndex = dictionaryCache.get(column).get(columnValueToIndex);
        } else {
            dictionaryIndex = dictionaryCreatorMap.get(column).indexOf(columnValueToIndex);
            dictionaryCache.get(column).put(columnValueToIndex, dictionaryIndex);
        }
        forwardIndexCreatorMap.get(column).index(docIdCounter, dictionaryIndex);
        if (config.createInvertedIndexEnabled()) {
            invertedIndexCreatorMap.get(column).add(docIdCounter, dictionaryIndex);
        }
    }
    docIdCounter++;
}

放入inverted index.

private void indexSingleValue(int entry, int docId) {
    if (entry == -1) {
        return;
    }
    invertedIndex[entry].add(docId);
}

保存, 输出到文件

@Override
public void seal() throws ConfigurationException, IOException {
    for (final String column : forwardIndexCreatorMap.keySet()) {
        forwardIndexCreatorMap.get(column).close();
        if (config.createInvertedIndexEnabled()) {
            invertedIndexCreatorMap.get(column).seal();
        }
        dictionaryCreatorMap.get(column).close();
    }
    writeMetadata();
}

@Override
public void seal() throws IOException {
    final DataOutputStream out =
        new DataOutputStream(new BufferedOutputStream(new FileOutputStream(invertedIndexFile)));
    // First, write out offsets of bitmaps. The information can be used to access a certain bitmap directly.
    // Totally (invertedIndex.length+1) offsets will be written out; the last offset is used to calculate the length of
    // the last bitmap, which might be needed when accessing bitmaps randomly.
    // If a bitmap's offset is k, then k bytes need to be skipped to reach the bitmap.
    int offset = 4 * (invertedIndex.length + 1); // The first bitmap's offset
    out.writeInt(offset);
    for (final MutableRoaringBitmap element : invertedIndex) { // the other bitmap's offset
        offset += element.serializedSizeInBytes();
        out.writeInt(offset);
    }
    // write out bitmaps one by one
    for (final MutableRoaringBitmap element : invertedIndex) {
        element.serialize(out);
    }
    out.close();
    LOGGER.debug("persisted bitmap inverted index for column : " + spec.getName() + " in "
        + invertedIndexFile.getAbsolutePath());
}

文件的前4 * (invertedIndex.length + 1)个字节存了(invertedIndex.length + 1)个整数。前invertedIndex.length个整数分别存储第1到第invertedIndex.length个value的inverted index所处的offset。最后一个整数存储整个文件的长度,方便快速计算最后一条索引的长度。

最后再说一下,inverted index用到了roaringbitmap,这是一种高度压缩的bitmap,并且性能非常好,甚至比非压缩bitmap更快,提供and、andNot、or、xor等运算。Druid也用到了这个包。Lucene也用到了相同的技术。关于RoaringBitmap可以参考这个post。

你可能感兴趣的:(Pinot中的Inverted Index源码分析)