OpenBitSet和OpenBitSetIterator在TermRangeQuery中的运用
在MultiTermQuery 的rewrite方法中,如果 if (pendingTerms.size() >= termCountLimit || docVisitCount >= docCountCutoff) 的就会使用MultiTermQueryWrapperFilter,如果查询出来的term的总数目大于termCountLimit或者docVisitCount是 df ,如果df 大于docCountCutoff 则使用MultiTermQueryWrapperFilter,否则使用BooleanQuery,他们之间的关系是or的关系, MultiTermQueryWrapperFilter 使用OpenBitSet收集docId,使用OpenBitSetIterator还原docId
@Override
public Query rewrite(IndexReader reader, MultiTermQuery query) throws IOException {
// Get the enum and start visiting terms. If we
// exhaust the enum before hitting either of the
// cutoffs, we use ConstantBooleanQueryRewrite; else,
// ConstantFilterRewrite:
final Collection<Term> pendingTerms = new ArrayList<Term>();
final int docCountCutoff = (int) ((docCountPercent / 100.) * reader.maxDoc());
final int termCountLimit = Math.min(BooleanQuery.getMaxClauseCount(), termCountCutoff);
int docVisitCount = 0;
FilteredTermEnum enumerator = query.getEnum(reader);
try {
while(true) {
Term t = enumerator.term();
if (t != null) {
pendingTerms.add(t);
// Loading the TermInfo from the terms dict here
// should not be costly, because 1) the
// query/filter will load the TermInfo when it
// runs, and 2) the terms dict has a cache:
docVisitCount += reader.docFreq(t);
}
if (pendingTerms.size() >= termCountLimit || docVisitCount >= docCountCutoff) {
// Too many terms -- make a filter.
Query result = new ConstantScoreQuery(new MultiTermQueryWrapperFilter<MultiTermQuery>(query));
result.setBoost(query.getBoost());
return result;
} else if (!enumerator.next()) {
// Enumeration is done, and we hit a small
// enough number of terms & docs -- just make a
// BooleanQuery, now
BooleanQuery bq = new BooleanQuery(true);
for (final Term term: pendingTerms) {
TermQuery tq = new TermQuery(term);
bq.add(tq, BooleanClause.Occur.SHOULD);
}
// Strip scores
Query result = new ConstantScoreQuery(new QueryWrapperFilter(bq));
result.setBoost(query.getBoost());
query.incTotalNumberOfTerms(pendingTerms.size());
return result;
}
}
} finally {
enumerator.close();
}
}
收集的docId的代码 调用如图所示
先new 一个OpenBitSet,大小是查询出来的当前segemt中最大文档的数目,然后通过
SegmentTermDocs 的public int read(final int[] docs, final int[] freqs)
这个方法读取docId和frg,然后 通过for循环
for(int i=0;i<count;i++) {
bitSet.set(docs[i]);
}
把docId放到OpenBitSet里面
代码如下和注释如下
public DocIdSet getDocIdSet(IndexReader reader) throws IOException {
//返回TermRangeTermEnum对象,这个对象先用用小的那个string new一个term,然后定位到tis文件,while循环读取term信息,然后去frg文件里面读取docId,在while循环里面,通过SegmentTermDocs读取frg文件的docId和frg。
final TermEnum enumerator = query.getEnum(reader);
try {
// if current term in enum is null, the enum is empty -> shortcut
if (enumerator.term() == null)
return DocIdSet.EMPTY_DOCIDSET;
// else fill into a OpenBitSet
final OpenBitSet bitSet = new OpenBitSet(reader.maxDoc());
final int[] docs = new int[32];
final int[] freqs = new int[32];
// new 一个SegmentTermDocs的实例,会调用它的read方法读取docId
TermDocs termDocs = reader.termDocs();
try {
int termCount = 0;
do {
Term term = enumerator.term();
if (term == null)
break;
termCount++;
SegmentTermDocs 在frg文件里面seek到term的对应的docid的开始位置
termDocs.seek(term);
while (true) {
// 读取docId,一次读取到32 个docId到 docs数组里面,如果没有32个则读取实际的数目
final int count = termDocs.read(docs, freqs);
if (count != 0) {
for(int i=0;i<count;i++) {
bitSet.set(docs[i]);
}
} else {
break;
}
}
} while (enumerator.next());
query.incTotalNumberOfTerms(termCount);
} finally {
termDocs.close();
}
return bitSet;
} finally {
enumerator.close();
}
}
enumerator.next()方法截图如下,enumerator是TermRangeTermEnum,会调用父类的FilteredTermEnum next方法。
FilteredTermEnum的next方法如下,他会调用actualEnum读取tis文件里面的下一个term,然后调用termCompare 方法,termCompare 这个方法是抽象方法,留给子类实现,
TermRangeTermEnum方法的实现逻辑是和右边的区间的term做一个比较,看查询的term是否超出区间
public boolean next() throws IOException {
if (actualEnum == null) return false; // the actual enumerator is not initialized!
currentTerm = null;
while (currentTerm == null) {
if (endEnum()) return false;
if (actualEnum.next()) {
Term term = actualEnum.term();
if (termCompare(term)) {
currentTerm = term;
return true;
}
}
else return false;
}
currentTerm = null;
return false;
}
还原是在ConstantScorer的nextDoc方法调用的如下图
public int nextDoc() throws IOException {
return docIdSetIterator.nextDoc();
}