2022-03-30

写作角度理解elasticsearch的高亮原理

下面部分内容 由小红书www.xiaohongshutuiguang.cn)转载提供

一、高亮的一些问题

elasticsearch提供了三种高亮方式,前面我们已经简单的了解了elasticsearch的高亮原理; 高亮处理跟实际使用查询类型有十分紧密的关系,其中主要的一点就是muti term 查询的重写,例如wildcard、prefix等,由于查询本身和高亮都涉及到查询语句的重写,如果两者之间的重写机制不同,那么就可能会碰到以下情况

相同的查询语句, 使用unified和fvh得到的高亮结果是不同的,甚至fvh Highlighter无任何高亮信息返回;

二、数据环境

elasticsearch 8.0

PUT highlight_test{"mappings": {"properties": {"text":{"type":"text","term_vector":"with_positions_offsets"}    }  },"settings": {"number_of_replicas":0,"number_of_shards": 1  }}PUT highlight_test/_doc/1{"name":"mango","text":"my name is mongo, i am test hightlight in elastic search"}

三、muti term查询重写简介

所谓muti term查询就是查询中并不是明确的关键字,而是需要elasticsearch重写查询语句,进一步明确关键字;以下查询会涉及到muti term查询重写;

fuzzyprefixquery_stringregexpwildcard

以上查询都支持rewrite参数,最终将查询重写为bool查询或者bitset;

查询重写主要影响以下几方面

重写需要抓取哪些关键字以及抓取的数量;

抓取关键字的相关性计算方式;

查询重写支持以下参数选项

constant_score,默认值,如果需要抓取的关键字比较少,则重写为bool查询,否则抓取所有的关键字并重写为bitset;直接使用boost参数作为文档score,一般term level的查询的boost默认值为1;

constant_score_boolean,将查询重写为bool查询,并使用boost参数作为文档的score,受到indices.query.bool.max_clause_count 限制,所以默认最多抓取1024个关键字;

scoring_boolean,将查询重写为bool查询,并计算文档的相对权重,受到indices.query.bool.max_clause_count 限制,所以默认最多抓取1024个关键字;

top_terms_blended_freqs_N,抓取得分最高的前N个关键字,并将查询重写为bool查询;此选项不受indices.query.bool.max_clause_count 限制;选择命中文档的所有关键字中权重最大的作为文档的score;

top_terms_boost_N,抓取得分最高的前N个关键字,并将查询重写为bool查询;此选项不受indices.query.bool.max_clause_count 限制;直接使用boost作为文档的score;

top_terms_N,抓取得分最高的前N个关键字,并将查询重写为bool查询;此选项不受indices.query.bool.max_clause_count 限制;计算命中文档的相对权重作为评分;

三、wildcard查询重写分析

我们通过elasticsearch来查看一下以下查询语句的重写逻辑;

{    "query":{        "wildcard":{            "text":{                "value":"m*"}        }    }}

通过查询使用的字段映射类型构建WildCardQuery,并使用查询语句中配置的rewrite对应的MultiTermQuery.RewriteMethod;

//WildcardQueryBuilder.java@OverrideprotectedQuerydoToQuery(SearchExecutionContext context)throwsIOException {MappedFieldTypefieldType=context.getFieldType(fieldName);if(fieldType ==null) {thrownewIllegalStateException("Rewrite first");    }    MultiTermQuery.RewriteMethodmethod=QueryParsers.parseRewriteMethod(rewrite,null, LoggingDeprecationHandler.INSTANCE);returnfieldType.wildcardQuery(value, method, caseInsensitive, context);}

根据查询语句中配置的rewrite,查找对应的MultiTermQuery.RewriteMethod,由于我们没有在wildcard查询语句中设置rewrite参数,这里直接返回null;

//QueryParsers.javapublicstaticMultiTermQuery.RewriteMethodparseRewriteMethod(@NullableString rewriteMethod,@NullableMultiTermQuery.RewriteMethod defaultRewriteMethod,    DeprecationHandler deprecationHandler){if(rewriteMethod ==null) {returndefaultRewriteMethod;    }if(CONSTANT_SCORE.match(rewriteMethod, deprecationHandler)) {returnMultiTermQuery.CONSTANT_SCORE_REWRITE;    }if(SCORING_BOOLEAN.match(rewriteMethod, deprecationHandler)) {returnMultiTermQuery.SCORING_BOOLEAN_REWRITE;    }if(CONSTANT_SCORE_BOOLEAN.match(rewriteMethod, deprecationHandler)) {returnMultiTermQuery.CONSTANT_SCORE_BOOLEAN_REWRITE;    }intfirstDigit=-1;for(inti=0; i < rewriteMethod.length(); ++i) {if(Character.isDigit(rewriteMethod.charAt(i))) {            firstDigit = i;break;        }    }if(firstDigit >=0) {finalintsize=Integer.parseInt(rewriteMethod.substring(firstDigit));StringrewriteMethodName=rewriteMethod.substring(0, firstDigit);if(TOP_TERMS.match(rewriteMethodName, deprecationHandler)) {returnnewMultiTermQuery.TopTermsScoringBooleanQueryRewrite(size);        }if(TOP_TERMS_BOOST.match(rewriteMethodName, deprecationHandler)) {returnnewMultiTermQuery.TopTermsBoostOnlyBooleanQueryRewrite(size);        }if(TOP_TERMS_BLENDED_FREQS.match(rewriteMethodName, deprecationHandler)) {returnnewMultiTermQuery.TopTermsBlendedFreqScoringRewrite(size);        }    }thrownewIllegalArgumentException("Failed to parse rewrite_method ["+ rewriteMethod +"]");}}

WildCardQuery继承MultiTermQuery,直接调用rewrite方法进行重写,由于我们没有在wildcard查询语句中设置rewrite参数,这里直接使用默认的CONSTANT_SCORE_REWRITE;

//MultiTermQuery.javaprotectedRewriteMethodrewriteMethod=CONSTANT_SCORE_REWRITE;@OverridepublicfinalQueryrewrite(IndexReader reader)throwsIOException {returnrewriteMethod.rewrite(reader,this);  }

可以看到CONSTANT_SCORE_REWRITE是直接使用的匿名类,rewrite方法返回的是MultiTermQueryConstantScoreWrapper的实例;

//MultiTermQuery.javapublicstaticfinalRewriteMethodCONSTANT_SCORE_REWRITE=newRewriteMethod() {@OverridepublicQueryrewrite(IndexReader reader, MultiTermQuery query){returnnewMultiTermQueryConstantScoreWrapper<>(query);        }      };

在以下方法中,首先会得到查询字段对应的所有term集合;

然后通过 query.getTermsEnum获取跟查询匹配的所有term集合;

最后根据collectTerms调用的返回值决定是否构建bool查询还是bit set;

//MultiTermQueryConstantScoreWrapper.javaprivateWeightOrDocIdSetrewrite(LeafReaderContext context)throwsIOException {finalTermsterms=context.reader().terms(query.field);if(terms ==null) {// field does not existreturnnewWeightOrDocIdSet((DocIdSet)null);        }finalTermsEnumtermsEnum=query.getTermsEnum(terms);asserttermsEnum !=null;PostingsEnumdocs=null;finalList collectedTerms =newArrayList<>();if(collectTerms(context, termsEnum, collectedTerms)) {// build a boolean queryBooleanQuery.Builderbq=newBooleanQuery.Builder();for(TermAndState t : collectedTerms) {finalTermStatestermStates=newTermStates(searcher.getTopReaderContext());            termStates.register(t.state, context.ord, t.docFreq, t.totalTermFreq);            bq.add(newTermQuery(newTerm(query.field, t.term), termStates), Occur.SHOULD);          }Queryq=newConstantScoreQuery(bq.build());finalWeightweight=searcher.rewrite(q).createWeight(searcher, scoreMode, score());returnnewWeightOrDocIdSet(weight);        }// Too many terms: go back to the terms we already collected and start building the bit setDocIdSetBuilderbuilder=newDocIdSetBuilder(context.reader().maxDoc(), terms);if(collectedTerms.isEmpty() ==false) {TermsEnumtermsEnum2=terms.iterator();for(TermAndState t : collectedTerms) {            termsEnum2.seekExact(t.term, t.state);            docs = termsEnum2.postings(docs, PostingsEnum.NONE);            builder.add(docs);          }        }// Then keep filling the bit set with remaining termsdo{          docs = termsEnum.postings(docs, PostingsEnum.NONE);          builder.add(docs);        }while(termsEnum.next() !=null);returnnewWeightOrDocIdSet(builder.build());      }

调用collectTerms默认只会提取查询命中的16个关键字;

//MultiTermQueryConstantScoreWrapper.javaprivatestaticfinalintBOOLEAN_REWRITE_TERM_COUNT_THRESHOLD=16;privatebooleancollectTerms(

          LeafReaderContext context, TermsEnum termsEnum, List terms)throwsIOException {finalintthreshold=Math.min(BOOLEAN_REWRITE_TERM_COUNT_THRESHOLD, IndexSearcher.getMaxClauseCount());for(inti=0; i < threshold; ++i) {finalBytesRefterm=termsEnum.next();if(term ==null) {returntrue;          }TermStatestate=termsEnum.termState();          terms.add(newTermAndState(                  BytesRef.deepCopyOf(term),                  state,                  termsEnum.docFreq(),                  termsEnum.totalTermFreq()));        }returntermsEnum.next() ==null;      }

通过以上分析wildcard查询默认情况下,会提取字段中所有命中查询的关键字;

四、fvh Highlighter中wildcard的查询重写

在muti term query中,提取查询关键字是高亮逻辑一个很重要的步骤;

我们使用以下高亮语句,分析以下高亮中提取查询关键字过程中的查询重写;

{    "query":{        "wildcard":{            "text":{                "value":"m*"}        }    },    "highlight":{        "fields":{            "text":{                "type":"fvh"}        }    }}

默认情况下只有匹配的字段才会进行高亮,这里构建CustomFieldQuery;

//FastVectorHighlighter.javaif(field.fieldOptions().requireFieldMatch()) {/*

    * we use top level reader to rewrite the query against all readers,

    * with use caching it across hits (and across readers...)

    */entry.fieldMatchFieldQuery = new CustomFieldQuery(        fieldContext.query,        hitContext.topLevelReader(),        true,field.fieldOptions().requireFieldMatch()    );}

通过调用flatten方法得到重写之后的flatQueries,然后将每个提取的关键字重写为BoostQuery;

//FieldQuery.javapublicFieldQuery(Query query, IndexReader reader,booleanphraseHighlight,booleanfieldMatch)throwsIOException {this.fieldMatch = fieldMatch;    Set flatQueries =newLinkedHashSet<>();    flatten(query, reader, flatQueries,1f);    saveTerms(flatQueries, reader);    Collection expandQueries = expand(flatQueries);for(Query flatQuery : expandQueries) {QueryPhraseMaprootMap=getRootMap(flatQuery);      rootMap.add(flatQuery, reader);floatboost=1f;while(flatQueryinstanceofBoostQuery) {BoostQuerybq=(BoostQuery) flatQuery;        flatQuery = bq.getQuery();        boost *= bq.getBoost();      }if(!phraseHighlight && flatQueryinstanceofPhraseQuery) {PhraseQuerypq=(PhraseQuery) flatQuery;if(pq.getTerms().length >1) {for(Term term : pq.getTerms()) rootMap.addTerm(term, boost);        }      }    }  }

由于WildCardQuery是MultiTermQuery的子类,所以在flatten方法中最终直接使用MultiTermQuery.TopTermsScoringBooleanQueryRewrite进行查询重写,这里的top N是MAX_MTQ_TERMS = 1024;

//FieldQuery.javaprivatestaticfinalintMAX_MTQ_TERMS=1024;protectedvoidflatten(      Query sourceQuery, IndexReader reader, Collection flatQueries,floatboost)throwsIOException {          ..................................    ..................................elseif(reader !=null) {Queryquery=sourceQuery;      Query rewritten;if(sourceQueryinstanceofMultiTermQuery) {        rewritten =newMultiTermQuery.TopTermsScoringBooleanQueryRewrite(MAX_MTQ_TERMS)                .rewrite(reader, (MultiTermQuery) query);      }else{        rewritten = query.rewrite(reader);      }if(rewritten != query) {// only rewrite once and then flatten again - the rewritten query could have a speacial// treatment// if this method is overwritten in a subclass.flatten(rewritten, reader, flatQueries, boost);      }// if the query is already rewritten we discard it}// else discard queries}

这里首先计算设置的size和getMaxSize(默认值1024, IndexSearcher.getMaxClauseCount())计算最终提取的命中关键字数量,这里最终是1024个;

这里省略了传入collectTerms的TermCollector匿名子类的实现,其余最终提取关键字数量有关;

//FieldQuery.java@OverridepublicfinalQueryrewrite(finalIndexReader reader,finalMultiTermQuery query)throwsIOException {finalintmaxSize=Math.min(size, getMaxSize());finalPriorityQueue stQueue =newPriorityQueue<>();    collectTerms(        reader,        query,newTermCollector() {                ................        });    .............returnbuild(b);  }

这里首先获取查询字段对应的所有term集合,然后获取所有的与查询匹配的term集合,最终通过传入的collector提取关键字;

//TermCollectingRewrite.javafinalvoidcollectTerms(IndexReader reader, MultiTermQuery query, TermCollector collector)throwsIOException {IndexReaderContexttopReaderContext=reader.getContext();for(LeafReaderContext context : topReaderContext.leaves()) {finalTermsterms=context.reader().terms(query.field);if(terms ==null) {// field does not existcontinue;      }finalTermsEnumtermsEnum=getTermsEnum(query, terms, collector.attributes);asserttermsEnum !=null;if(termsEnum == TermsEnum.EMPTY)continue;      collector.setReaderContext(topReaderContext, context);      collector.setNextEnum(termsEnum);      BytesRef bytes;while((bytes = termsEnum.next()) !=null) {if(!collector.collect(bytes))return;// interrupt whole term collection, so also don't iterate other subReaders}    }  }

这里通过控制最终提取匹配查询的关键字的数量不超过maxSize;

//TopTermsRewrite.java@Overridepublicbooleancollect(BytesRef bytes)throwsIOException {finalfloatboost=boostAtt.getBoost();// make sure within a single seg we always collect// terms in orderassertcompareToLastTerm(bytes);// System.out.println("TTR.collect term=" + bytes.utf8ToString() + " boost=" + boost + "// ord=" + readerContext.ord);// ignore uncompetitive hitsif(stQueue.size() == maxSize) {finalScoreTermt=stQueue.peek();if(boost < t.boost)returntrue;if(boost == t.boost && bytes.compareTo(t.bytes.get()) >0)returntrue;            }ScoreTermt=visitedTerms.get(bytes);finalTermStatestate=termsEnum.termState();assertstate !=null;if(t !=null) {// if the term is already in the PQ, only update docFreq of term in PQassertt.boost == boost :"boost should be equal in all segment TermsEnums";              t.termState.register(                  state, readerContext.ord, termsEnum.docFreq(), termsEnum.totalTermFreq());            }else{// add new entry in PQ, we must clone the term, else it may get overwritten!st.bytes.copyBytes(bytes);              st.boost = boost;              visitedTerms.put(st.bytes.get(), st);assertst.termState.docFreq() ==0;              st.termState.register(                  state, readerContext.ord, termsEnum.docFreq(), termsEnum.totalTermFreq());              stQueue.offer(st);// possibly drop entries from queueif(stQueue.size() > maxSize) {                st = stQueue.poll();                visitedTerms.remove(st.bytes.get());                st.termState.clear();// reset the termstate!}else{                st =newScoreTerm(newTermStates(topReaderContext));              }assertstQueue.size() <= maxSize :"the PQ size must be limited to maxSize";// set maxBoostAtt with values to help FuzzyTermsEnum to optimizeif(stQueue.size() == maxSize) {                t = stQueue.peek();                maxBoostAtt.setMaxNonCompetitiveBoost(t.boost);                maxBoostAtt.setCompetitiveTerm(t.bytes.get());              }            }returntrue;          }

通过以上分析可以看到,fvh Highlighter对multi term query的重写,直接使用MultiTermQuery.TopTermsScoringBooleanQueryRewrite,并限制只能最多提取查询关键字1024个;

五、重写可能导致的高亮问题原因分析

经过以上对查询和高亮的重写过程分析可以知道,默认情况下

query阶段提取的是命中查询的所有的关键字,具体行为可以通过rewrite参数进行定制;

Highlight阶段提取的是命中查询的关键字中的前1024个,具体行为不受rewrite参数的控制;

如果查询的字段是大文本字段,导致字段的关键字很多,就可能会出现查询命中的文档的关键字不在前1024个里边,从而导致明明匹配了文档,但是却没有返回高亮信息;

六、解决方案

进一步明确查询关键字,减少查询命中的关键字的数量,例如输入更多的字符,;

使用其他类型的查询替换multi term query;

你可能感兴趣的:(2022-03-30)