本章开始介绍lucene的查询过程,即IndexSearcher的search函数,
IndexSearcher::search
public TopDocs search(Query query, int n)
throws IOException {
return searchAfter(null, query, n);
}
public TopDocs searchAfter(ScoreDoc after, Query query, int numHits) throws IOException {
final int limit = Math.max(1, reader.maxDoc());
numHits = Math.min(numHits, limit);
final int cappedNumHits = Math.min(numHits, limit);
final CollectorManager manager = new CollectorManager() {
@Override
public TopScoreDocCollector newCollector() throws IOException {
...
}
@Override
public TopDocs reduce(Collection collectors) throws IOException {
...
}
};
return search(query, manager);
}
传入的参数query封装了查询语句,n代表取前n个结果。searchAfter函数前面的计算保证最后的文档数量n不会超过所有文档的数量。接下来创建CollectorManager,并调用重载的search继续执行。
IndexSearch::search->searchAfter->search
public T> T search(Query query, CollectorManagerT> collectorManager) throws IOException {
if (executor == null) {
final C collector = collectorManager.newCollector();
search(query, collector);
return collectorManager.reduce(Collections.singletonList(collector));
} else {
...
}
}
假设查询过程为单线程,此时executor为空。首先通过CollectorManager的newCollector创建TopScoreDocCollector,每个TopScoreDocCollector封装了最后的查询结果,如果是多线程查询,则最后要对多个TopScoreDocCollector进行合并。
IndexSearch::search->searchAfter->search->CollectorManager::newCollector
public TopScoreDocCollector newCollector() throws IOException {
return TopScoreDocCollector.create(cappedNumHits, after);
}
public static TopScoreDocCollector create(int numHits, ScoreDoc after) {
if (after == null) {
return new SimpleTopScoreDocCollector(numHits);
} else {
return new PagingTopScoreDocCollector(numHits, after);
}
}
参数after用来实现类似分页的效果,这里假设为null。newCollector函数最终返回SimpleTopScoreDocCollector。创建完TopScoreDocCollector后,接下来调用重载的search函数继续执行。
IndexSearch::search->searchAfter->search->search
public void search(Query query, Collector results)
throws IOException {
search(leafContexts, createNormalizedWeight(query, results.needsScores()), results);
}
leafContexts是CompositeReaderContext中的leaves成员变量,是一个LeafReaderContext列表,每个LeafReaderContext封装了每个段的SegmentReader,SegmentReader可以读取每个段的所有信息和数据。接下来通过createNormalizedWeight函数进行查询匹配,并计算一些基本的权重用来给后面的打分过程使用。
public Weight createNormalizedWeight(Query query, boolean needsScores) throws IOException {
query = rewrite(query);
Weight weight = createWeight(query, needsScores);
float v = weight.getValueForNormalization();
float norm = getSimilarity(needsScores).queryNorm(v);
if (Float.isInfinite(norm) || Float.isNaN(norm)) {
norm = 1.0f;
}
weight.normalize(norm, 1.0f);
return weight;
}
首先通过rewrite函数对Query进行重写,例如删除一些不必要的项,将非原子查询转化为原子查询。
IndexSearch::search->searchAfter->search->search->createNormalizedWeight->rewrite
public Query rewrite(Query original) throws IOException {
Query query = original;
for (Query rewrittenQuery = query.rewrite(reader); rewrittenQuery != query;
rewrittenQuery = query.rewrite(reader)) {
query = rewrittenQuery;
}
return query;
}
这里循环调用每个Query的rewrite函数进行重写,之所以循环是因为可能一次重写改变Query结构后又产生了可以被重写的部分,下面假设这里的query为BooleanQuery,BooleanQuery并不包含真正的查询语句,而是包含多个子查询,每个子查询可以是TermQuery这样不可再分的Query,也可以是另一个BooleanQuery。
由于BooleanQuery的rewrite函数较长,下面分段来看。
IndexSearch::search->searchAfter->search->search->createNormalizedWeight->BooleanQuery::rewrite
第一部分
public Query rewrite(IndexReader reader) throws IOException {
if (clauses.size() == 1) {
BooleanClause c = clauses.get(0);
Query query = c.getQuery();
if (minimumNumberShouldMatch == 1 && c.getOccur() == Occur.SHOULD) {
return query;
} else if (minimumNumberShouldMatch == 0) {
switch (c.getOccur()) {
case SHOULD:
case MUST:
return query;
case FILTER:
return new BoostQuery(new ConstantScoreQuery(query), 0);
case MUST_NOT:
return new MatchNoDocsQuery();
default:
throw new AssertionError();
}
}
}
...
}
如果BooleanQuery中只有一个子查询,则没必要对其封装,直接取出该子查询中的Query即可。
minimumNumberShouldMatch成员变量表示至少需要匹配多少项,如果唯一的子查询条件为SHOULD,并且匹配1项就行了,则直接返回对应的Query,如果条件为MUST或者SHOULD,也是直接返回子查询中的Query,如果条件为FILTER,则直接通过BoostQuery封装并返回,如果条件为MUST_NOT,则说明唯一的查询不需要查询任何文档,直接创建MatchNODocsQuery即可。
IndexSearch::search->searchAfter->search->search->createNormalizedWeight->BooleanQuery::rewrite
第二部分
public Query rewrite(IndexReader reader) throws IOException {
...
{
BooleanQuery.Builder builder = new BooleanQuery.Builder();
builder.setDisableCoord(isCoordDisabled());
builder.setMinimumNumberShouldMatch(getMinimumNumberShouldMatch());
boolean actuallyRewritten = false;
for (BooleanClause clause : this) {
Query query = clause.getQuery();
Query rewritten = query.rewrite(reader);
if (rewritten != query) {
actuallyRewritten = true;
}
builder.add(rewritten, clause.getOccur());
}
if (actuallyRewritten) {
return builder.build();
}
}
...
}
这部分rewrite函数遍历BooleanQuery下的所有的子查询列表,嵌套调用rewrite函数,如果某次rewrite函数返回的Query和原来的Query不一样,则说明某个子查询被重写了,此时通过BooleanQuery.Builder的build函数重新生成BooleanQuery。
IndexSearch::search->searchAfter->search->search->createNormalizedWeight->BooleanQuery::rewrite
第三部分
public Query rewrite(IndexReader reader) throws IOException {
...
{
int clauseCount = 0;
for (Collection queries : clauseSets.values()) {
clauseCount += queries.size();
}
if (clauseCount != clauses.size()) {
BooleanQuery.Builder rewritten = new BooleanQuery.Builder();
rewritten.setDisableCoord(disableCoord);
rewritten.setMinimumNumberShouldMatch(minimumNumberShouldMatch);
for (Map.Entry> entry : clauseSets.entrySet()) {
final Occur occur = entry.getKey();
for (Query query : entry.getValue()) {
rewritten.add(query, occur);
}
}
return rewritten.build();
}
}
...
}
clauseSets中保存了MUST_NOT和FILTER对应的子查询Clause,并使用HashSet进行存储。利用HashSet的结构可以去除重复的条件为MUST_NOT和FILTER的子查询。
IndexSearch::search->searchAfter->search->search->createNormalizedWeight->BooleanQuery::rewrite
第四部分
public Query rewrite(IndexReader reader) throws IOException {
...
if (clauseSets.get(Occur.MUST).size() > 0 && clauseSets.get(Occur.FILTER).size() > 0) {
final Set filters = new HashSet(clauseSets.get(Occur.FILTER));
boolean modified = filters.remove(new MatchAllDocsQuery());
modified |= filters.removeAll(clauseSets.get(Occur.MUST));
if (modified) {
BooleanQuery.Builder builder = new BooleanQuery.Builder();
builder.setDisableCoord(isCoordDisabled());
builder.setMinimumNumberShouldMatch(getMinimumNumberShouldMatch());
for (BooleanClause clause : clauses) {
if (clause.getOccur() != Occur.FILTER) {
builder.add(clause);
}
}
for (Query filter : filters) {
builder.add(filter, Occur.FILTER);
}
return builder.build();
}
}
...
}
删除条件为FILTER又同时为MUST的子查询,同时删除查询所有文档的子查询(因为此时子查询的数量肯定大于1),查询所有文档的结果集里包含了任何其他查询的结果集。
IndexSearch::search->searchAfter->search->search->createNormalizedWeight->BooleanQuery::rewrite
第五部分
{
final Collection musts = clauseSets.get(Occur.MUST);
final Collection filters = clauseSets.get(Occur.FILTER);
if (musts.size() == 1
&& filters.size() > 0) {
Query must = musts.iterator().next();
float boost = 1f;
if (must instanceof BoostQuery) {
BoostQuery boostQuery = (BoostQuery) must;
must = boostQuery.getQuery();
boost = boostQuery.getBoost();
}
if (must.getClass() == MatchAllDocsQuery.class) {
BooleanQuery.Builder builder = new BooleanQuery.Builder();
for (BooleanClause clause : clauses) {
switch (clause.getOccur()) {
case FILTER:
case MUST_NOT:
builder.add(clause);
break;
default:
break;
}
}
Query rewritten = builder.build();
rewritten = new ConstantScoreQuery(rewritten);
builder = new BooleanQuery.Builder()
.setDisableCoord(isCoordDisabled())
.setMinimumNumberShouldMatch(getMinimumNumberShouldMatch())
.add(rewritten, Occur.MUST);
for (Query query : clauseSets.get(Occur.SHOULD)) {
builder.add(query, Occur.SHOULD);
}
rewritten = builder.build();
return rewritten;
}
}
}
return super.rewrite(reader);
如果某个MatchAllDocsQuery是唯一的类型为MUST的Query,则对其进行重写。最后如果没有重写,就调用父类Query的rewrite直接返回其自身。
看完了BooleanQuery的rewrite函数,下面简单介绍一下其他类型Query的rewrite函数。
TermQuery的rewrite函数,直接返回自身。SynonymQuery的rewrite函数检测是否只包含一个Query,如果只有一个Query,则将其转化为TermQuery。WildcardQuery、PrefixQuery、RegexpQuery以及FuzzyQuery都继承自MultiTermQuery。WildcardQuery的rewrite函数返回一个封装了原来Query的MultiTermQueryConstantScoreWrapper。PrefixQuery的rewrite函数返回一个MultiTermQueryConstantScoreWrapper。RegexpQuery类似PrefixQuery。FuzzyQuery最后根据情况返回一个BlendedTermQuery。
回到createNormalizedWeight函数中,重写完Query之后,接下来通过createWeight函数进行匹配并计算权重。
IndexSearch::search->searchAfter->search->search->createNormalizedWeight->createWeight
public Weight createWeight(Query query, boolean needsScores) throws IOException {
final QueryCache queryCache = this.queryCache;
Weight weight = query.createWeight(this, needsScores);
if (needsScores == false && queryCache != null) {
weight = queryCache.doCache(weight, queryCachingPolicy);
}
return weight;
}
IndexSearch中的成员变量queryCache被初始化为LRUQueryCache。createWeight函数会调用各个Query中的createWeight函数,假设为BooleanQuery。
IndexSearch::search->searchAfter->search->search->createNormalizedWeight->createWeight->BooleanQuery::createWeight
public Weight createWeight(IndexSearcher searcher, boolean needsScores) throws IOException {
BooleanQuery query = this;
if (needsScores == false) {
query = rewriteNoScoring();
}
return new BooleanWeight(query, searcher, needsScores, disableCoord);
}
needsScores在SimpleTopScoreDocCollector中默认返回true。createWeight函数创建BooleanWeight并返回。
IndexSearch::search->searchAfter->search->search->createNormalizedWeight->createWeight->BooleanWeight::createWeight->BooleanWeight::BooleanWeight
BooleanWeight(BooleanQuery query, IndexSearcher searcher, boolean needsScores, boolean disableCoord) throws IOException {
super(query);
this.query = query;
this.needsScores = needsScores;
this.similarity = searcher.getSimilarity(needsScores);
weights = new ArrayList<>();
int i = 0;
int maxCoord = 0;
for (BooleanClause c : query) {
Weight w = searcher.createWeight(c.getQuery(), needsScores && c.isScoring());
weights.add(w);
if (c.isScoring()) {
maxCoord++;
}
i += 1;
}
this.maxCoord = maxCoord;
coords = new float[maxCoord+1];
Arrays.fill(coords, 1F);
coords[0] = 0f;
if (maxCoord > 0 && needsScores && disableCoord == false) {
boolean seenActualCoord = false;
for (i = 1; i < coords.length; i++) {
coords[i] = coord(i, maxCoord);
seenActualCoord |= (coords[i] != 1F);
}
this.disableCoord = seenActualCoord == false;
} else {
this.disableCoord = true;
}
}
getSimilarity函数默认返回IndexSearcher中的BM25Similarity。BooleanWeight函数嵌套调用createWeight获取子查询的Weight,假设子查询为TermQuery,后面来看TermQuery的createWeight函数。maxCoord用来表示有多少个子查询,最后面的coords数组能够影响检索文档的得分,计算公式为coord(q,d) = q/d。
IndexSearch::search->searchAfter->search->search->createNormalizedWeight->createWeight->TermQuery::createWeight
public Weight createWeight(IndexSearcher searcher, boolean needsScores) throws IOException {
final IndexReaderContext context = searcher.getTopReaderContext();
final TermContext termState;
if (perReaderTermState == null
|| perReaderTermState.topReaderContext != context) {
termState = TermContext.build(context, term);
} else {
termState = this.perReaderTermState;
}
return new TermWeight(searcher, needsScores, termState);
}
getTopReaderContext返回CompositeReaderContext,封装了SegmentReader。
perReaderTermState默认为null,因此接下来通过TermContext的build函数进行匹配并获取对应的Term在索引表中的相应信息,最后根据得到的信息TermContext创建TermWeight并返回。
IndexSearch::search->searchAfter->search->search->createNormalizedWeight->createWeight->TermQuery::createWeight->TermContext::build
public static TermContext build(IndexReaderContext context, Term term)
throws IOException {
final String field = term.field();
final BytesRef bytes = term.bytes();
final TermContext perReaderTermState = new TermContext(context);
for (final LeafReaderContext ctx : context.leaves()) {
final Terms terms = ctx.reader().terms(field);
if (terms != null) {
final TermsEnum termsEnum = terms.iterator();
if (termsEnum.seekExact(bytes)) {
final TermState termState = termsEnum.termState();
perReaderTermState.register(termState, ctx.ord, termsEnum.docFreq(), termsEnum.totalTermFreq());
}
}
}
return perReaderTermState;
}
Term的bytes函数返回查询的字节,默认的是UTF-8编码。LeafReaderContext的reader函数返回SegmentReader,对应的terms函数返回FieldReader用来读取文件中的信息。
IndexSearcher::search->searchAfter->search->search->createNormalizedWeight->createWeight->TermQuery::createWeight->TermContext::build->SegmentReader::terms
public final Terms terms(String field) throws IOException {
return fields().terms(field);
}
public final Fields fields() {
return getPostingsReader();
}
public FieldsProducer getPostingsReader() {
ensureOpen();
return core.fields;
}
public Terms terms(String field) throws IOException {
FieldsProducer fieldsProducer = fields.get(field);
return fieldsProducer == null ? null : fieldsProducer.terms(field);
}
core在SegmentReader构造函数中创建为SegmentCoreReaders,对应fields为PerFieldPostingsFormat。fields.get最终返回BlockTreeTermsReader,在创建索引时设置的。
BlockTreeTermsReader的terms最终返回对应域的FieldReader。
回到TermContext的build函数中,接下来的iterator函数返回SegmentTermsEnum,然后通过seekExact函数查询匹配,如果匹配,通过SegmentTermsEnum的termState函数返回一个IntBlockTermState,里面封装该Term的各个信息,seekExact函数在下一章分析。build函数最后通过TermContext的register函数保存计算获得的IntBlockTermState。
IndexSearch::search->searchAfter->search->search->createNormalizedWeight->createWeight->TermQuery::createWeight->TermContext::build->register
public void register(TermState state, final int ord, final int docFreq, final long totalTermFreq) {
register(state, ord);
accumulateStatistics(docFreq, totalTermFreq);
}
public void register(TermState state, final int ord) {
states[ord] = state;
}
public void accumulateStatistics(final int docFreq, final long totalTermFreq) {
this.docFreq += docFreq;
if (this.totalTermFreq >= 0 && totalTermFreq >= 0)
this.totalTermFreq += totalTermFreq;
else
this.totalTermFreq = -1;
}
传入的参数ord用于标识一个唯一的IndexReaderContext,即一个段。register函数将TermState,其实是IntBlockTermState存储进数组states中,然后通过accumulateStatistics更新统计信息。
回到TermQuery的createWeight函数中,最后创建一个TermWeight并返回。
IndexSearch::search->searchAfter->search->search->createNormalizedWeight->createWeight->TermQuery::createWeight->TermWeight::TermWeight
public TermWeight(IndexSearcher searcher, boolean needsScores, TermContext termStates)
throws IOException {
super(TermQuery.this);
this.needsScores = needsScores;
this.termStates = termStates;
this.similarity = searcher.getSimilarity(needsScores);
final CollectionStatistics collectionStats;
final TermStatistics termStats;
if (needsScores) {
collectionStats = searcher.collectionStatistics(term.field());
termStats = searcher.termStatistics(term, termStates);
} else {
...
}
this.stats = similarity.computeWeight(collectionStats, termStats);
}
整体上看,collectionStatistics函数用来统计某个域中的信息,termStatistics函数用来统计某个词的信息。
最后通过这两个信息调用computeWeight函数计算权重。下面分别来看。
IndexSearcher::search->searchAfter->search->search->createNormalizedWeight->createWeight->TermQuery::createWeight->TermWeight::TermWeight->IndexSearcher::collectionStatistics
public CollectionStatistics collectionStatistics(String field) throws IOException {
final int docCount;
final long sumTotalTermFreq;
final long sumDocFreq;
Terms terms = MultiFields.getTerms(reader, field);
if (terms == null) {
docCount = 0;
sumTotalTermFreq = 0;
sumDocFreq = 0;
} else {
docCount = terms.getDocCount();
sumTotalTermFreq = terms.getSumTotalTermFreq();
sumDocFreq = terms.getSumDocFreq();
}
return new CollectionStatistics(field, reader.maxDoc(), docCount, sumTotalTermFreq, sumDocFreq);
}
getTerms函数和前面的分析类似,最后返回一个FieldReader,然后获取docCount文档数、sumTotalTermFreq所有termFreq(每篇文档有多少个Term)的总和、sumDocFreq所有docFreq(多少篇文档包含Term)的总和,最后创建CollectionStatistics封装这些信息并返回。
IndexSearcher::search->searchAfter->search->search->createNormalizedWeight->createWeight->TermQuery::createWeight->TermWeight::TermWeight->IndexSearcher::termStatistics
public TermStatistics termStatistics(Term term, TermContext context) throws IOException {
return new TermStatistics(term.bytes(), context.docFreq(), context.totalTermFreq());
}
docFreq返回有多少篇文档包含该词,totalTermFreq返回文档中包含多少个该词,最后创建一个TermStatistics并返回,构造函数简单。
回到TermWeight的构造函数中,similarity默认为BM25Similarity,computeWeight函数如下。
IndexSearcher::search->searchAfter->search->search->createNormalizedWeight->createWeight->TermQuery::createWeight->TermWeight::TermWeight->BM25Similarity::computeWeight
public final SimWeight computeWeight(CollectionStatistics collectionStats, TermStatistics... termStats) {
Explanation idf = termStats.length == 1 ? idfExplain(collectionStats, termStats[0]) : idfExplain(collectionStats, termStats);
float avgdl = avgFieldLength(collectionStats);
float cache[] = new float[256];
for (int i = 0; i < cache.length; i++) {
cache[i] = k1 * ((1 - b) + b * decodeNormValue((byte)i) / avgdl);
}
return new BM25Stats(collectionStats.field(), idf, avgdl, cache);
}
idfExplain函数用来计算idf,即反转文档频率。
IndexSearcher::search->searchAfter->search->search->createNormalizedWeight->createWeight->TermQuery::createWeight->TermWeight::TermWeight->BM25Similarity::computeWeight->idfExplain
public Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics termStats) {
final long df = termStats.docFreq();
final long docCount = collectionStats.docCount() == -1 ? collectionStats.maxDoc() : collectionStats.docCount();
final float idf = idf(df, docCount);
return Explanation.match(idf, "idf(docFreq=" + df + ", docCount=" + docCount + ")");
}
df为多少篇文档包含该词,docCount为文档总数,idf函数的计算公式如下,
1 + log(numDocs/(docFreq+1)),含义是如果文档中出现Term的频率越高显得文档越不重要。
回到computeWeight中,avgFieldLength函数用来计算每篇文档包含词的平均数。
IndexSearcher::search->searchAfter->search->search->createNormalizedWeight->createWeight->TermQuery::createWeight->TermWeight::TermWeight->BM25Similarity::computeWeight->avgFieldLength
protected float avgFieldLength(CollectionStatistics collectionStats) {
final long sumTotalTermFreq = collectionStats.sumTotalTermFreq();
if (sumTotalTermFreq <= 0) {
return 1f;
} else {
final long docCount = collectionStats.docCount() == -1 ? collectionStats.maxDoc() : collectionStats.docCount();
return (float) (sumTotalTermFreq / (double) docCount);
}
}
avgFieldLength函数将词频总数除以文档数,得到每篇文档的平均词数。回到computeWeight中,接下来计算BM25的相关系数,BM25是lucene进行排序的算法,最后创建BM25Stats并返回。
回到createNormalizedWeight中,接下来通过getValueForNormalization函数计算权重。
IndexSearch::search->searchAfter->search->search->createNormalizedWeight->BooleanWeight::getValueForNormalization
public float getValueForNormalization() throws IOException {
float sum = 0.0f;
int i = 0;
for (BooleanClause clause : query) {
float s = weights.get(i).getValueForNormalization();
if (clause.isScoring()) {
sum += s;
}
i += 1;
}
return sum ;
}
BooleanWeight的getValueForNormalization函数用来累积子查询中getValueForNormalization函数返回的值。假设子查询为TermQuery,对应的Weight为TermWeight,其getValueForNormalization函数如下。
IndexSearch::search->searchAfter->search->search->createNormalizedWeight->BooleanWeight::getValueForNormalization->TermWeight::getValueForNormalization
public float getValueForNormalization() {
return stats.getValueForNormalization();
}
public float getValueForNormalization() {
return weight * weight;
}
public void normalize(float queryNorm, float boost) {
this.boost = boost;
this.weight = idf.getValue() * boost;
}
stats就是是BM25Stats,其getValueForNormalization函数最终返回idf值乘以boost后的平方。
回到createNormalizedWeight中,queryNorm函数直接返回1,normalize函数根据norm重新计算权重。首先看BooleanWeight的normalize函数,
IndexSearch::search->searchAfter->search->search->createNormalizedWeight->BooleanWeight::normalize
public void normalize(float norm, float boost) {
for (Weight w : weights) {
w.normalize(norm, boost);
}
}
假设子查询对应的Weight为TermWeight。
IndexSearch::search->searchAfter->search->search->createNormalizedWeight->TermWeight::normalize
public void normalize(float queryNorm, float boost) {
stats.normalize(queryNorm, boost);
}
public void normalize(float queryNorm, float boost) {
this.boost = boost;
this.weight = idf.getValue() * boost;
}
回到IndexSearcher的search函数中,createNormalizedWeight返回Weight后,继续调用重载的search函数,定义如下,
IndexSearch::search->searchAfter->search->search->search
protected void search(List leaves, Weight weight, Collector collector)
throws IOException {
for (LeafReaderContext ctx : leaves) {
final LeafCollector leafCollector;
try {
leafCollector = collector.getLeafCollector(ctx);
} catch (CollectionTerminatedException e) {
}
BulkScorer scorer = weight.bulkScorer(ctx);
if (scorer != null) {
try {
scorer.score(leafCollector, ctx.reader().getLiveDocs());
} catch (CollectionTerminatedException e) {
}
}
}
}
根据《lucence源码分析—6》leaves是封装了SegmentReader的LeafReaderContext列表,collector是SimpleTopScoreDocCollector。
IndexSearch::search->searchAfter->search->search->search->SimpleTopScoreDocCollector::getLeafCollector
public LeafCollector getLeafCollector(LeafReaderContext context)
throws IOException {
final int docBase = context.docBase;
return new ScorerLeafCollector() {
@Override
public void collect(int doc) throws IOException {
float score = scorer.score();
totalHits++;
if (score <= pqTop.score) {
return;
}
pqTop.doc = doc + docBase;
pqTop.score = score;
pqTop = pq.updateTop();
}
};
}
getLeafCollector函数创建ScorerLeafCollector并返回。
回到search函数中,接下来通过Weight的bulkScorer函数获得BulkScorer,用来计算得分。
假设通过createNormalizedWeight函数创建的Weight为BooleanWeight,下面来看其bulkScorer函数,
IndexSearcher::search->searchAfter->search->search->search->BooleanWeight::bulkScorer
public BulkScorer bulkScorer(LeafReaderContext context) throws IOException {
final BulkScorer bulkScorer = booleanScorer(context);
if (bulkScorer != null) {
return bulkScorer;
} else {
return super.bulkScorer(context);
}
}
bulkScorer函数首先创建一个booleanScorer,假设为null,下面调用其父类Weight的bulkScorer函数并返回。
IndexSearcher::search->searchAfter->search->search->search->Weight::bulkScorer
public BulkScorer bulkScorer(LeafReaderContext context) throws IOException {
Scorer scorer = scorer(context);
if (scorer == null) {
return null;
}
return new DefaultBulkScorer(scorer);
}
scorer函数重定义在BooleanWeight中,
IndexSearcher::search->searchAfter->search->search->search->Weight::bulkScorer->BooleanWeight::scorer
public Scorer scorer(LeafReaderContext context) throws IOException {
int minShouldMatch = query.getMinimumNumberShouldMatch();
List required = new ArrayList<>();
List requiredScoring = new ArrayList<>();
List prohibited = new ArrayList<>();
List optional = new ArrayList<>();
Iterator cIter = query.iterator();
for (Weight w : weights) {
BooleanClause c = cIter.next();
Scorer subScorer = w.scorer(context);
if (subScorer == null) {
if (c.isRequired()) {
return null;
}
} else if (c.isRequired()) {
required.add(subScorer);
if (c.isScoring()) {
requiredScoring.add(subScorer);
}
} else if (c.isProhibited()) {
prohibited.add(subScorer);
} else {
optional.add(subScorer);
}
}
if (optional.size() == minShouldMatch) {
required.addAll(optional);
requiredScoring.addAll(optional);
optional.clear();
minShouldMatch = 0;
}
if (required.isEmpty() && optional.isEmpty()) {
return null;
} else if (optional.size() < minShouldMatch) {
return null;
}
if (!needsScores && minShouldMatch == 0 && required.size() > 0) {
optional.clear();
}
if (optional.isEmpty()) {
return excl(req(required, requiredScoring, disableCoord), prohibited);
}
if (required.isEmpty()) {
return excl(opt(optional, minShouldMatch, disableCoord), prohibited);
}
Scorer req = excl(req(required, requiredScoring, true), prohibited);
Scorer opt = opt(optional, minShouldMatch, true);
if (disableCoord) {
if (minShouldMatch > 0) {
return new ConjunctionScorer(this, Arrays.asList(req, opt), Arrays.asList(req, opt), 1F);
} else {
return new ReqOptSumScorer(req, opt);
}
} else if (optional.size() == 1) {
if (minShouldMatch > 0) {
return new ConjunctionScorer(this, Arrays.asList(req, opt), Arrays.asList(req, opt), coord(requiredScoring.size()+1, maxCoord));
} else {
float coordReq = coord(requiredScoring.size(), maxCoord);
float coordBoth = coord(requiredScoring.size() + 1, maxCoord);
return new BooleanTopLevelScorers.ReqSingleOptScorer(req, opt, coordReq, coordBoth);
}
} else {
if (minShouldMatch > 0) {
return new BooleanTopLevelScorers.CoordinatingConjunctionScorer(this, coords, req, requiredScoring.size(), opt);
} else {
return new BooleanTopLevelScorers.ReqMultiOptScorer(req, opt, requiredScoring.size(), coords);
}
}
}
BooleanWeight的scorer函数会循环调用每个子查询对应的Weight的scorer函数,假设为TermWeight。
IndexSearcher::search->searchAfter->search->search->search->Weight::bulkScorer->BooleanWeight::scorer->TermWeight::scorer
public Scorer scorer(LeafReaderContext context) throws IOException {
final TermsEnum termsEnum = getTermsEnum(context);
PostingsEnum docs = termsEnum.postings(null, needsScores ? PostingsEnum.FREQS : PostingsEnum.NONE);
return new TermScorer(this, docs, similarity.simScorer(stats, context));
}
IndexSearcher::search->searchAfter->search->search->search->Weight::bulkScorer->BooleanWeight::scorer->TermWeight::scorer->getTermsEnum
private TermsEnum getTermsEnum(LeafReaderContext context) throws IOException {
final TermState state = termStates.get(context.ord);
final TermsEnum termsEnum = context.reader().terms(term.field())
.iterator();
termsEnum.seekExact(term.bytes(), state);
return termsEnum;
}
首先获得前面查询的结果TermState,iterator函数返回SegmentTermsEnum。SegmentTermsEnum的seekExact函数主要是封装前面的查询结果TermState,具体的细节下一章再研究。
回到TermWeight的scorer函数中,接下来调用SegmentTermsEnum的postings函数,最终调用Lucene50PostingsReader的postings函数。
IndexSearcher::search->searchAfter->search->search->search->Weight::bulkScorer->BooleanWeight::scorer->TermWeight::scorer->Lucene50PostingsReader::postings
public PostingsEnum postings(FieldInfo fieldInfo, BlockTermState termState, PostingsEnum reuse, int flags) throws IOException {
boolean indexHasPositions = fieldInfo.getIndexOptions().compareTo(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS) >= 0;
boolean indexHasOffsets = fieldInfo.getIndexOptions().compareTo(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS) >= 0;
boolean indexHasPayloads = fieldInfo.hasPayloads();
if (indexHasPositions == false || PostingsEnum.featureRequested(flags, PostingsEnum.POSITIONS) == false) {
BlockDocsEnum docsEnum;
if (reuse instanceof BlockDocsEnum) {
...
} else {
docsEnum = new BlockDocsEnum(fieldInfo);
}
return docsEnum.reset((IntBlockTermState) termState, flags);
} else if ((indexHasOffsets == false || PostingsEnum.featureRequested(flags, PostingsEnum.OFFSETS) == false) &&
(indexHasPayloads == false || PostingsEnum.featureRequested(flags, PostingsEnum.PAYLOADS) == false)) {
...
} else {
...
}
}
首先取出索引文件中的存储类型。假设进入第一个if语句,reuse参数默认为null,因此接下来创建一个BlockDocsEnum,并通过reset函数初始化。
IndexSearcher::search->searchAfter->search->search->search->Weight::bulkScorer->BooleanWeight::scorer->TermWeight::scorer->BM25Similarity::simScorer
public final SimScorer simScorer(SimWeight stats, LeafReaderContext context) throws IOException {
BM25Stats bm25stats = (BM25Stats) stats;
return new BM25DocScorer(bm25stats, context.reader().getNormValues(bm25stats.field));
}
public final NumericDocValues getNormValues(String field) throws IOException {
ensureOpen();
Map normFields = normsLocal.get();
NumericDocValues norms = normFields.get(field);
if (norms != null) {
return norms;
} else {
FieldInfo fi = getFieldInfos().fieldInfo(field);
if (fi == null || !fi.hasNorms()) {
return null;
}
norms = getNormsReader().getNorms(fi);
normFields.put(field, norms);
return norms;
}
}
getNormsReader返回Lucene53NormsProducer,Lucene53NormsProducer的getNorms函数根据域信息以及.nvd、.nvm文件读取信息创建NumericDocValues并返回。simScorer最终创建BM25DocScorer并返回。
回到TermWeight的scorer函数中,最后创建TermScorer并返回。
再回到BooleanWeight的scorer函数中,如果SHOULD条件的Scorer等于minShouldMatch,则表明及时条件为SHOULD但也必须得到满足,此时将其归入MUST条件中。再往下,如果SHOULD以及MUST对应的Scorer都为空,则表明没有任何查询条件,返回空,如果SHOULD条件的Scorer小于minShouldMatch,则表明SHOULD条件下查询到的匹配字符太少,也返回空。再往下,如果optional为空,则没有SHOULD条件的Scorer,此时通过req封装MUST条件的Scorer,并通过excl排除MUST_NOT条件的Scorer;相反,如果required为空,则没有MUST条件的Scorer,此时通过opt封装SHOULD条件的Scorer。
IndexSearcher::search->searchAfter->search->search->search->Weight::bulkScorer->BooleanWeight::scorer->req
private Scorer req(List required, List requiredScoring, boolean disableCoord) {
if (required.size() == 1) {
Scorer req = required.get(0);
if (needsScores == false) {
return req;
}
if (requiredScoring.isEmpty()) {
return new FilterScorer(req) {
@Override
public float score() throws IOException {
return 0f;
}
@Override
public int freq() throws IOException {
return 0;
}
};
}
float boost = 1f;
if (disableCoord == false) {
boost = coord(1, maxCoord);
}
if (boost == 1f) {
return req;
}
return new BooleanTopLevelScorers.BoostedScorer(req, boost);
} else {
return new ConjunctionScorer(this, required, requiredScoring,
disableCoord ? 1.0F : coord(requiredScoring.size(), maxCoord));
}
}
如果MUST条件下的Scorer数量大于1,则直接创建ConjunctionScorer并返回,如果requiredScoring为空,则对应唯一的MUST条件下的Scorer并不需要评分,此时直接返回FilterScorer,否则通过计算返回BoostedScorer,其他情况下直接返回对应的Scorer。
IndexSearcher::search->searchAfter->search->search->search->Weight::bulkScorer->BooleanWeight::scorer->excl
private Scorer excl(Scorer main, List prohibited) throws IOException {
if (prohibited.isEmpty()) {
return main;
} else if (prohibited.size() == 1) {
return new ReqExclScorer(main, prohibited.get(0));
} else {
float coords[] = new float[prohibited.size()+1];
Arrays.fill(coords, 1F);
return new ReqExclScorer(main, new DisjunctionSumScorer(this, prohibited, coords, false));
}
}
excl根据是否有MUST_NOT条件的Scorer将Scorer进一步封装成ReqExclScorer(表示需要排除prohibited中的Scorer)或者直接返回。
IndexSearcher::search->searchAfter->search->search->search->Weight::bulkScorer->BooleanWeight::scorer->opt
private Scorer opt(List optional, int minShouldMatch, boolean disableCoord) throws IOException {
if (optional.size() == 1) {
Scorer opt = optional.get(0);
if (!disableCoord && maxCoord > 1) {
return new BooleanTopLevelScorers.BoostedScorer(opt, coord(1, maxCoord));
} else {
return opt;
}
} else {
float coords[];
if (disableCoord) {
coords = new float[optional.size()+1];
Arrays.fill(coords, 1F);
} else {
coords = this.coords;
}
if (minShouldMatch > 1) {
return new MinShouldMatchSumScorer(this, optional, minShouldMatch, coords);
} else {
return new DisjunctionSumScorer(this, optional, coords, needsScores);
}
}
}
和req函数类似,opt函数根据不同情况将Scorer封装成BoostedScorer、MinShouldMatchSumScorer、DisjunctionSumScorer或者直接返回。
回到BooleanWeight的scorer函数中,接下来根据disableCoord、minShouldMatch和optional的值将Scorer进一步封装成ConjunctionScorer、ReqOptSumScorer、ReqSingleOptScorer、CoordinatingConjunctionScorer或者ReqMultiOptScorer。
回到Weight的bulkScorer和BooleanWeight的bulkScorer函数中,接下来将scorer函数返回的结果进一步封装成DefaultBulkScorer并返回。
再向上回到IndexSearcher的search函数中,接下来调用刚刚创建的DefaultBulkScorer的score函数,
IndexSearch::search->searchAfter->search->search->search->DefaultBulkScorer::score
public void score(LeafCollector collector, Bits acceptDocs) throws IOException {
final int next = score(collector, acceptDocs, 0, DocIdSetIterator.NO_MORE_DOCS);
}
public int score(LeafCollector collector, Bits acceptDocs, int min, int max) throws IOException {
collector.setScorer(scorer);
if (scorer.docID() == -1 && min == 0 && max == DocIdSetIterator.NO_MORE_DOCS) {
scoreAll(collector, iterator, twoPhase, acceptDocs);
return DocIdSetIterator.NO_MORE_DOCS;
} else {
...
}
}
static void scoreAll(LeafCollector collector, DocIdSetIterator iterator, TwoPhaseIterator twoPhase, Bits acceptDocs) throws IOException {
if (twoPhase == null) {
for (int doc = iterator.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = iterator.nextDoc()) {
if (acceptDocs == null || acceptDocs.get(doc)) {
collector.collect(doc);
}
}
} else {
...
}
}
默认情况下,需要对所有命中文档进行评分,因此这里往下看scoreAll函数,首先遍历iterator,然后依次调用ScorerLeafCollector的collect函数进行打分。
IndexSearch::search->searchAfter->search->search->search->DefaultBulkScorer::score->score->scoreAll->ScorerLeafCollector::collect
public void collect(int doc) throws IOException {
float score = scorer.score();
totalHits++;
if (score <= pqTop.score) {
return;
}
pqTop.doc = doc + docBase;
pqTop.score = score;
pqTop = pq.updateTop();
}
成员变量scorer假设为ReqOptSumScorer,对应的score函数计算文档的分数,因为本博文不涉及lucene的打分算法,因此就不往下看了。最后设置刚刚计算的分数到队列pqTop中,并调用updateTop更新队列指针。
回到再上一层的search函数中,下面来看CollectorManager的reduce函数。
IndexSearch::search->searchAfter->search->CollectorManager::reduce
public TopDocs reduce(Collection collectors) throws IOException {
final TopDocs[] topDocs = new TopDocs[collectors.size()];
int i = 0;
for (TopScoreDocCollector collector : collectors) {
topDocs[i++] = collector.topDocs();
}
return TopDocs.merge(cappedNumHits, topDocs);
}
这里遍历collectors,依次取出SimpleTopScoreDoc并调用其topDocs函数取出前面的文档信息。
IndexSearch::search->searchAfter->search->CollectorManager::reduce->SimpleTopScoreDoc::topDocs
public TopDocs topDocs() {
return topDocs(0, topDocsSize());
}
public TopDocs topDocs(int start, int howMany) {
int size = topDocsSize();
if (start < 0 || start >= size || howMany <= 0) {
return newTopDocs(null, start);
}
howMany = Math.min(size - start, howMany);
ScoreDoc[] results = new ScoreDoc[howMany];
for (int i = pq.size() - start - howMany; i > 0; i--) { pq.pop(); }
populateResults(results, howMany);
return newTopDocs(results, start);
}
topDocsSize返回命中的文档数,pop函数删除队列中没有用的ScoreDoc,populateResults函数复制最后剩下的ScoreDoc至results中,最后通过newTopDocs函数创建TopDocs封装ScoreDoc数组。
回到CollectorManager的reduce函数中,最后通过merge函数合并不同TopScoreDocCollector产生的TopDocs结果。
IndexSearch::search->searchAfter->search->CollectorManager::reduce->TopDocs::merge
public static TopDocs merge(int topN, TopDocs[] shardHits) throws IOException {
return merge(0, topN, shardHits);
}
public static TopDocs merge(int start, int topN, TopDocs[] shardHits) throws IOException {
return mergeAux(null, start, topN, shardHits);
}
private static TopDocs mergeAux(Sort sort, int start, int size, TopDocs[] shardHits) throws IOException {
final PriorityQueue queue;
if (sort == null) {
queue = new ScoreMergeSortQueue(shardHits);
} else {
queue = new MergeSortQueue(sort, shardHits);
}
int totalHitCount = 0;
int availHitCount = 0;
float maxScore = Float.MIN_VALUE;
for(int shardIDX=0;shardIDXfinal TopDocs shard = shardHits[shardIDX];
totalHitCount += shard.totalHits;
if (shard.scoreDocs != null && shard.scoreDocs.length > 0) {
availHitCount += shard.scoreDocs.length;
queue.add(new ShardRef(shardIDX));
maxScore = Math.max(maxScore, shard.getMaxScore());
}
}
if (availHitCount == 0) {
maxScore = Float.NaN;
}
final ScoreDoc[] hits;
if (availHitCount <= start) {
hits = new ScoreDoc[0];
} else {
hits = new ScoreDoc[Math.min(size, availHitCount - start)];
int requestedResultWindow = start + size;
int numIterOnHits = Math.min(availHitCount, requestedResultWindow);
int hitUpto = 0;
while (hitUpto < numIterOnHits) {
ShardRef ref = queue.top();
final ScoreDoc hit = shardHits[ref.shardIndex].scoreDocs[ref.hitIndex++];
hit.shardIndex = ref.shardIndex;
if (hitUpto >= start) {
hits[hitUpto - start] = hit;
}
hitUpto++;
if (ref.hitIndex < shardHits[ref.shardIndex].scoreDocs.length) {
queue.updateTop();
} else {
queue.pop();
}
}
}
if (sort == null) {
return new TopDocs(totalHitCount, hits, maxScore);
} else {
return new TopFieldDocs(totalHitCount, hits, sort.getSort(), maxScore);
}
}
mergeAux简单来说就是合并多个TopDocs为一个TopDocs并返回。
到此IndexSearcher的search函数的大体过程就分析到这了,下一章开始分析lucene读取索引文件的具体函数。