Disjunction
Max析取最大(并集)
本质多域联合搜索,并且不同域指定不同的权重,命中时取最大得分域结果作为结果得分。与直接多域boost求和是完全不同的结果。使用起来非常复杂,需要debugquery
看结果,反复尝试!
http://wiki.apache.org/solr/DisMax
http://searchhub.org/dev/2010/05/23/whats-a-dismax/
What’sa“DisMax”?Posted
byhossman
Theterm“dismax”gets
tossedaround(被抛出来)on
theSolrlistsfrequently,whichcanbefairlyconfusingtonew
users.Itoriginatedasashorthandnameforthe
DisMaxRequestHandler(whichInamedafterthe
DisjunctionMaxQueryParser,whichInamedafterthe
DisjunctionMaxQueryclassthatitusesheavily).Inrecent
years,theDisMaxRequestHandlerandtheStandardRequestHandlerwere
bothrefactoredinto(重构)
asingleSearchHandlerclass,and
nowtheterm“dismax”usuallyreferstothe
DisMaxQParser.
注解:dismax现在对应于DisMaxQParser,而DismaxRequestHandler与standardRequestHandler重构到SearchHandler中
ClearasMudd,
right?
Regardlessofwhetheryou
usetheDisMaxRequestHandlerviatheqt=dismax
parameter,orusetheSearchHandlerwiththeDisMaxQParserviadefType=dismax
theendresultisthatyourq
parametergetsparsedbythe
DisjunctionMaxQueryParser.
注解:qt=dismax,采取DisMaxRequestHandler,而defType=dismax,是SearchHandler中使用DisMaxQParser,二者q的参数采取DisJunctionMaxQueryParser解析
The
originalgoalsofdismax(whichevermeaningyoumightinfer)
haveneverchanged:
…supportsasimplified
versionoftheLuceneQueryParsersyntax.Quotescanbeusedto
groupphrases(分组短语),and
+/-canbeusedtodenotemandatory(强制性、必选的)andoptional(可选的)clauses…butallotherLucenequeryparser
specialcharactersareescapedtosimplifytheuserexperience.The
handlertakesresponsibilityforbuildingagoodqueryfromthe
user’sinputusingBooleanQueriescontainingDisjunctionMaxQueries
acrossfieldsandboostsyouspecifyItalsoallowsyoutoprovide
additionalboostingqueries,boostingfunctions,andfiltering
queriestoartificially(人工)affecttheoutcomeofallsearches.Theseoptionscanall
bespecifiedasdefaultparametersforthehandlerinyour
solrconfig.xmloroverriddentheSolrqueryURL.
Inshort:Youworryabout
whatfieldsandboostsyouwanttousewhenyouconfigureit,your
usersjustgiveyouwordsw/oworryingtoomuchabout
syntax.
注解:dismax句柄主要负责使用布尔查询封装DisjunctionMaxQueries,同时允许手工执行query激励、函数激励、过滤query影响最终搜索结果。所有参数可以通过在solrconfig.xml中配置,作为全局查询用,也可以通过url添加参数,在每一次或者每一类查询中动态使用。
Themagicofdismax(inmy
opinion)comesfromthequerystructureitproduces.Whatit
essentiallyboilsdowntois
matrixmultiplication:aonecolumnmatrixofeach“chunk”of
youruser’sinput,multipliedbyaonerowmatrixoftheqf
fieldstoproduceabigmatrixofeveryfield:chunk
permutation(排列).
ThematrixisthenturnedintoaBooleanQueryconsistingof
DisjunctionMaxQueriesforeachrow
inthematrix.DisjunctionMaxQueryisusedbecause
it’sscoreisdeterminedbythemaximumscoreofit’s
subclauses―insteadofthesumlikeaBooleanQuery―sonoone
wordfromtheuserinputdominatesthefinalscore.Thebestwayto
explainthisiswithanexample,solet’sconsiderthefollowing
input…
span lang="EN-US">defType = dismax</span> <span lang="EN-US"><span> </span>mm = 50%</span> <span lang="EN-US"><span> </span>qf = features^2 name^3</span> <span lang="EN-US"><span> </span>q = +"apache solr" search server</span>
Firstoff,weconsiderthe
“markup”charactersoftheparserthatappearinthisq
string:
whitespace�Cdividinginput
stringintochunk(
分词)
quotes�Cmakesasinglephrase
chunk(
括号)
+�Cmakesachunkmandatory
(
组合关系)
Sowehave3“chunks”ofuserinput:
“apachesolr”(must
match)
“search”(should
match)
“server”(should
match>
Ifwe“multiply”thatwith
ourqf
list(features,name)
wegeta
matrixlikethis…
features:”apache |
name:”apache |
(mustmatch) |
features:”search” |
name:”search” |
(shouldmatch) |
features:”server” |
name:”server” |
(shouldmatch) |
Ifwethenfactorinthemm
paramtodetermingthe“minimumnumberof
‘ShouldMatch’clausesthat(ahem)mustmatch”(50%of2==1)we
getthefollowingquerystructure(inpsuedo-code)…
<span lang="EN-US">q = BooleanQuery(</span> <span lang="EN-US"><span> </span><b>minNumberShouldMatch</b> => 1,</span> <span lang="EN-US"><span> </span>booleanClauses => ClauseList(</span> <span lang="EN-US"><span> </span>MustMatch(DisjunctionMaxQuery(</span> <span lang="EN-US"><span> </span>PhraseQuery("features","apache solr")^2,</span> <span lang="EN-US"><span> </span>PhraseQuery("name","apache solr")^3)</span> <span lang="EN-US"><span> </span>),</span> <span lang="EN-US"><span> </span>ShouldMatch(DisjunctionMaxQuery(</span> <span lang="EN-US"><span> </span>TermQuery("features","search")^2,</span> <span lang="EN-US"><span> </span>TermQuery("name","search")^3)</span> <span lang="EN-US"><span> </span>),</span> <span lang="EN-US"><span> </span>ShouldMatch(DisjunctionMaxQuery(</span> <span lang="EN-US"><span> </span>TermQuery("features","server")^2,</span> <span lang="EN-US"><span> </span>TermQuery("name","server")^3))</span> <span lang="EN-US">));</span> <span lang="EN-US"> </span> <span style="font-size:9.0pt"><b>注解:<span lang="EN-US">boolean</span>查询这个是最最基本的原子查询,其他高级查询都是基于这个查询的组合、封装,<span lang="EN-US">Dismax</span>也是如此。从<span lang="EN-US">dismax qp</span>分解过程和定义看,<span lang="EN-US">dismax</span>也是分解为<span lang="EN-US">boolean</span>查询,并且<span lang="EN-US">field</span>激励也同一般域<span lang="EN-US">boost</span>一致,但是不同的时候<span lang="EN-US">dismax</span>是以最大得分作为最终得分,而一般多域独立<span lang="EN-US">boost</span>时候是求和得分。<br> <br></b></span>
Withmesofar
right?
Wherepeopletendtoget
trippedup(绊倒),isinthinkingabouthowSolr’sper-fieldanalysis
configuration(inschema.xml)impactsallofthis.Ourexample
abovewasprettystraightforward,butletsconsiderforamoment
whatmighthappenif:
Thename
fieldusestheWordDelimiterFilter(单词分割符过滤器)atquerytimebutfeatures
doesnot.
Thefeatures
fieldisconfiguredsothat“the”isastopword,butname
is
not.
Nowlet’slookatwhatwe
getwhenourinputparametersarestructurallysimilartowhatwe
hadbefore,butjustdifferentenoughtoforWordDelimiterFilter
andStopFiltertocomeintoplay…
<span lang="EN-US">defType = dismax</span> <span lang="EN-US"><span> </span>mm = 50%</span> <span lang="EN-US"><span> </span>qf = features^2 name^3</span> <span lang="EN-US"><span> </span>q = +"apache solr" the search-server</span>
Ourresultingqueryisgoing
tobesomethinglike…
<span lang="EN-US">q = BooleanQuery(</span> <span lang="EN-US"><span> </span>minNumberShouldMatch => 1,</span> <span lang="EN-US"><span> </span>booleanClauses => ClauseList(</span> <span lang="EN-US"><span> </span><span style="color:red">MustMatch</span>(DisjunctionMaxQuery(</span> <span lang="EN-US"><span> </span>PhraseQuery("features","apache solr")^2,</span> <span lang="EN-US"><span> </span>PhraseQuery("name","apache solr")^3)</span> <span lang="EN-US"><span> </span>),</span> <span lang="EN-US"><span> </span><span style="color:red">ShouldMatch</span>(DisjunctionMaxQuery(</span> <span lang="EN-US"><span> </span>TermQuery("name","the")^3)</span> <span lang="EN-US"><span> </span>),</span> <span lang="EN-US"><span> </span><span style="color:red">ShouldMatch</span>(DisjunctionMaxQuery(</span> <span lang="EN-US"><span> </span>TermQuery("features","search-server")^2,</span> <span lang="EN-US"><span> </span>PhraseQuery("name","search server")^3))</span> <span lang="EN-US"><span> </span>));</span>
Theuseof
WordDelimiterFilterhasn’tchangedthingsverymuch:featuresis
treating“search-server”asasingleTerm,whileinthename
fieldwearesearchingforthephrase“search
server”―hopefullythisshouldn’tsurpriseanyonegiventheuseof
WordDelimiterFilterforthenamefield(presumablythat’swhyit’s
beingused).ThisDisjunctionMaxQuerystill“makessense”,but
otherfieldswithoddanalysisthatproduceless/moreTokensthena
“typical”fieldforthesamethunkmightproducequeriesthat
aren’taseasilytounderstand.Inparticularconsiderwhathas
happenedinourexamplewiththeword“the”:Because“the”isa
stopwordinthefeatures
field,noQueryobjectis
producedforthatfield/chunkcombination.ButaQueryisproduced
forthename
field,whichmeansthetotalnumberof
“ShouldMatch”clausesinourtoplevelqueryisstill2soour
minNumberShouldMatchisstill1(50%of2==1).
Thistypeofsituationtends
toconfusealotofpeople:since“the”isastopwordinone
field,theydon’texpectittomatterinthefinalquery―butas
longasatleastoneqf
fieldproducesaTokenforit
(name
inourexample)itwillbeincludedinthefinal
query,andwillcontributetothecountof“ShouldMatch”
clauses.
So,what’sthetakeaway
fromallofthis?
DisMaxisacomplicated
creature.Whenusingit,youneedtoconsiderallofit’s
optionscarefully,andlookatthedebugQuery=true
outputwhileexperimentingwithdifferentquerystringsand
differentanalysisconfigurationstomakereallysureyou
understandhowqueriesfromyouruserswillbeparsed.
注解:dismax构造非常复杂,使用的时候需要仔细考虑所有选项,同时,开启debugQuery=true,针对不同的查询串和分词器。
Forqf(QueryFields),pf(PhraseFields),
mm(Minimum‘Should’Match),andtie(TieBreaker),
see:theSolr
WikiDisMaxQParserPlugin.
Solr:
ForcingitemswithallquerytermstothetopofaSolrsearch
RobotLibrarian
http://robotlibrarian.billdueber.com/solr-forcing-items-with-all-query-terms-to-the-top-of-a-solr-search/
LucidImaginationSolrPoweredISFDB�CPart
#10:TweakingRelevancy
http://searchhub.org/dev/2011/06/20/solr-powered-isfdb-part-10/
LucidImaginationSolrPoweredISFDB�CPart
#11:UsingDisMax
http://searchhub.org/dev/2011/08/08/solr-powered-isfdb-part-11/
http://tm.durusau.net/?p=21573
Using
Solr’sDismaxTieParameterAnotherWordForIt(tie
breake配合断路器)
http://java.dzone.com/articles/using-solrs-dismax-tie
SolrPoweredISFDB�CPart#11:Using
DisMax
http://searchhub.org/dev/2011/06/20/solr-powered-isfdb-part-10/