所谓SpanQuery也即在查询过程中需要考虑进Term的位置信息的查询对象。
SpanQuery中最基本的是SpanTermQuery,其只包含一个Term,与TermQuery所不同的是,其提供一个函数来得到位置信息:
public Spans getSpans(final IndexReader reader) throws IOException { return new TermSpans(reader.termPositions(term), term); } |
Spans有以下方法:
SpanScorer的nextDoc函数如下:
public int nextDoc() throws IOException { if (!setFreqCurrentDoc()) { doc = NO_MORE_DOCS; } return doc; } |
protected boolean setFreqCurrentDoc() throws IOException { if (!more) { return false; } doc = spans.doc(); freq = 0.0f; do { //根据结束位置和起始位置来计算freq从而影响打分 int matchLength = spans.end() - spans.start(); freq += getSimilarity().sloppyFreq(matchLength); more = spans.next(); } while (more && (doc == spans.doc())); return true; } |
SpanFirstQuery仅取在开头部分包含查询词的文档,其包含如下成员变量:
其getSpans函数如下:
public Spans getSpans(final IndexReader reader) throws IOException { return new Spans() { private Spans spans = match.getSpans(reader); @Override public boolean next() throws IOException { while (spans.next()) { //仅查询词的位置在设定的end之前的文档才返回。 if (end() <= end) return true; } return false; } @Override public boolean skipTo(int target) throws IOException { if (!spans.skipTo(target)) return false; return spans.end() <= end || next(); } @Override public int doc() { return spans.doc(); } @Override public int start() { return spans.start(); } @Override public int end() { return spans.end(); } }; } |
SpanNearQuery包含以下成员变量:
其getSpans函数如下:
public Spans getSpans(final IndexReader reader) throws IOException { if (clauses.size() == 0) return new SpanOrQuery(getClauses()).getSpans(reader); if (clauses.size() == 1) return clauses.get(0).getSpans(reader); return inOrder ? (Spans) new NearSpansOrdered(this, reader, collectPayloads) : (Spans) new NearSpansUnordered(this, reader); } |
是否inorder,举例如下:
假设索引了文档"apple boy cat",如果将SpanNearQuery的clauses依次设为"apple","cat","boy",如果inorder=true,则文档不会被搜索出来,即便slop设为很大,如果inorder=false,则文档会被搜出来,而且slop设为0就能被搜出来。
因为在NearSpansOrdered的next函数如下:
public boolean next() throws IOException { if (firstTime) { firstTime = false; for (int i = 0; i < subSpans.length; i++) { //每个子SpanQuery都取第一篇文档 if (! subSpans[i].next()) { more = false; return false; } } more = true; } if(collectPayloads) { matchPayload.clear(); } return advanceAfterOrdered(); } |
private boolean advanceAfterOrdered() throws IOException { //如果各子SpanQuery指向同一文档 while (more && (inSameDoc || toSameDoc())) { //stretchToOrder要保证各子SpanQuery一定是按照顺序排列的 //shrinkToAfterShortestMatch保证各子SpanQuery之间的距离不大于slop if (stretchToOrder() && shrinkToAfterShortestMatch()) { return true; } } return false; } |
private boolean stretchToOrder() throws IOException { matchDoc = subSpans[0].doc(); for (int i = 1; inSameDoc && (i < subSpans.length); i++) { //docSpansOrdered要保证第i-1个子SpanQuery的start和end都应在第i个之前,否则取下一篇文档。 while (! docSpansOrdered(subSpans[i-1], subSpans[i])) { if (! subSpans[i].next()) { inSameDoc = false; more = false; break; } else if (matchDoc != subSpans[i].doc()) { inSameDoc = false; break; } } } return inSameDoc; } |
static final boolean docSpansOrdered(Spans spans1, Spans spans2) { assert spans1.doc() == spans2.doc() : "doc1 " + spans1.doc() + " != doc2 " + spans2.doc(); int start1 = spans1.start(); int start2 = spans2.start(); return (start1 == start2) ? (spans1.end() < spans2.end()) : (start1 < start2); } |
private boolean shrinkToAfterShortestMatch() throws IOException { //从最后一个子SpanQuery开始 matchStart = subSpans[subSpans.length - 1].start(); matchEnd = subSpans[subSpans.length - 1].end(); int matchSlop = 0; int lastStart = matchStart; int lastEnd = matchEnd; for (int i = subSpans.length - 2; i >= 0; i—) { //不断的取前一个子SpanQuery Spans prevSpans = subSpans[i]; int prevStart = prevSpans.start(); int prevEnd = prevSpans.end(); while (true) { if (! prevSpans.next()) { inSameDoc = false; more = false; break; } else if (matchDoc != prevSpans.doc()) { inSameDoc = false; break; } else { int ppStart = prevSpans.start(); int ppEnd = prevSpans.end(); if (! docSpansOrdered(ppStart, ppEnd, lastStart, lastEnd)) { break; } else { prevStart = ppStart; prevEnd = ppEnd; } } } assert prevStart <= matchStart; if (matchStart > prevEnd) { //总是从下一个的开始位置,减去前一个的结束位置,所以上面的例子中,如果将SpanNearQuery的clauses依次设为"apple","boy","cat",inorder=true, slop=0,是能够搜索的出的。 matchSlop += (matchStart - prevEnd); } matchStart = prevStart; lastStart = prevStart; lastEnd = prevEnd; } boolean match = matchSlop <= allowedSlop; return match; } |
NearSpansUnordered的next函数如下:
public boolean next() throws IOException { if (firstTime) { //将一个Spans生成一个SpansCell,既放入链表中,也放入优先级队列中,在队列中按照第一篇文档号由小到大排列,若文档号相同,则按照位置顺序排列。 initList(true); listToQueue(); firstTime = false; } else if (more) { if (min().next()) { //最上面的取下一篇文档,并调整队列。 queue.updateTop(); } else { more = false; } } while (more) { boolean queueStale = false; if (min().doc() != max.doc()) { //如果队列中最小的文档号和最大的文档号不相同,将队列生成链表。 queueToList(); queueStale = true; } //应该不断的skip每个子SpanQuery直到最小的文档号和最大的文档号相同,不同的是在文档中的位置。 while (more && first.doc() < last.doc()) { more = first.skipTo(last.doc()); firstToLast(); queueStale = true; } if (!more) return false; //调整完毕后,将链表写回队列。 if (queueStale) { listToQueue(); queueStale = false; } //判断是否匹配 if (atMatch()) { return true; } more = min().next(); if (more) { queue.updateTop(); } } return false; } |
private boolean atMatch() { //匹配有两个条件,一个是最小和最大的文档号相同,一个是最大的结束位置减去最小的开始位置再减去最大和最小的自身的长度之和小于等于slop。 //在上面的例子中,如果将SpanNearQuery的clauses依次设为"cat","apple",inorder=false,则slop设为1可以搜索的出来。因为"cat".end = 3, "apple".start=0, totalLength = ("cat".end – "cat".start) + ("apple".end – "apple.start") = 2,所以slop=1即可。 return (min().doc() == max.doc()) && ((max.end() - min().start() - totalLength) <= slop); } |
SpanNotQuery包含如下两个成员变量:
其next函数从include中取出文档号,如果exclude也包括此文档号,则过滤掉。
其getSpans函数如下:
public Spans getSpans(final IndexReader reader) throws IOException { return new Spans() { private Spans includeSpans = include.getSpans(reader); private boolean moreInclude = true; private Spans excludeSpans = exclude.getSpans(reader); private boolean moreExclude = excludeSpans.next(); @Override public boolean next() throws IOException { //得到下一个include的文档号 if (moreInclude) moreInclude = includeSpans.next(); //此循环查看此文档号是否被exclude,如果是则取下一个include的文档号。 while (moreInclude && moreExclude) { //将exclude跳到include文档号 if (includeSpans.doc() > excludeSpans.doc()) moreExclude = excludeSpans.skipTo(includeSpans.doc()); //当include和exclude文档号相同的时候,不断取得下一个exclude,如果exclude的end大于include的start,则说明当前文档号应该被exclude。 while (moreExclude && includeSpans.doc() == excludeSpans.doc() && excludeSpans.end() <= includeSpans.start()) { moreExclude = excludeSpans.next(); } //如果是因为没有exclude了,或者文档号不相同,或者include的end小于exclude的start,则当前文档不应该被exclude。 if (!moreExclude || includeSpans.doc() != excludeSpans.doc() || includeSpans.end() <= excludeSpans.start()) break; //否则此文档应该被exclude,include取下一篇文档号。 moreInclude = includeSpans.next(); } return moreInclude; } @Override public int doc() { return includeSpans.doc(); } @Override public int start() { return includeSpans.start(); } @Override public int end() { return includeSpans.end(); } }; } |
SpanOrQuery包含一个列表的子SpanQuery,并对它们取OR的关系,用于满足"apple和boy临近或者cat和dog临近的文档"此类的查询。
其OR的合并算法同BooleanQuery的OR关系的算法DisjunctionSumScorer类似。
public boolean next() throws IOException { if (queue == null) { return initSpanQueue(-1); } if (queue.size() == 0) { return false; } //在优先级队列顶部取下一篇文档或者下一位置,并重新排列队列 if (top().next()) { queue.updateTop(); return true; } //如果最顶部的SpanQuery没有下一篇文档或者下一位置,则弹出 queue.pop(); return queue.size() != 0; } |
在SpanNearQuery中,需要进行位置比较,相互比较位置的Term必须要在同一个域中,否则报异常IllegalArgumentException("Clauses must have same field.").
然而有时候我们需要对不同的域中的位置进行比较,例如:
文档一:
teacherid: 1 studentfirstname: james studentsurname: jones |
我们建索引如下:
Document doc = new Document(); doc.add(new Field("teacherid", "1", Field.Store.YES, Field.Index.NOT_ANALYZED)); doc.add(new Field("studentfirstname", "james", Field.Store.YES, Field.Index.NOT_ANALYZED)); doc.add(new Field("studentsurname", "jones", Field.Store.YES, Field.Index.NOT_ANALYZED)); writer.addDocument(doc); |
文档二:
teacherid: 2 studenfirstname: james studentsurname: smith studentfirstname: sally studentsurname: jones |
我们建索引如下:
doc = new Document(); doc.add(new Field("teacherid", "2", Field.Store.YES, Field.Index.NOT_ANALYZED)); doc.add(new Field("studentfirstname", "james", Field.Store.YES, Field.Index.NOT_ANALYZED)); doc.add(new Field("studentsurname", "smith", Field.Store.YES, Field.Index.NOT_ANALYZED)); doc.add(new Field("studentfirstname", "sally", Field.Store.YES, Field.Index.NOT_ANALYZED)); doc.add(new Field("studentsurname", "jones", Field.Store.YES, Field.Index.NOT_ANALYZED)); writer.addDocument(doc); |
现在我们想找firstname是james,surname是jones的学生的老师,显然如果搜索"studenfirstname: james AND studentsurname: jones",显然上面两个老师都能够搜索出来,可以辨别james和jones属于同一学生的一种方法是位置信息,也即当james和jones处于两个域的同一位置的时候,其属于同一个学生。
这时我们如果声明两个SpanTermQuery:
SpanQuery q1 = new SpanTermQuery(new Term("studentfirstname", "james")); SpanQuery q2 = new SpanTermQuery(new Term("studentsurname", "jones")); |
然后构建SpanNearQuery,子SpanQuery为上述q1, q2,因为在同一位置inorder=false,slop设为-1,因为
"jones".end – "james".start – totallength = 1 – 0 – 2 = -1,这样就能够搜的出来。
然而在构建SpanNearQuery的时候,其构造函数如下:
public SpanNearQuery(SpanQuery[] clauses, int slop, boolean inOrder, boolean collectPayloads) { this.clauses = new ArrayList<SpanQuery>(clauses.length); for (int i = 0; i < clauses.length; i++) { SpanQuery clause = clauses[i]; if (i == 0) { field = clause.getField(); } else if (!clause.getField().equals(field)) { //要求所有的子SpanQuery都属于同一个域 throw new IllegalArgumentException("Clauses must have same field."); } this.clauses.add(clause); } this.collectPayloads = collectPayloads; this.slop = slop; this.inOrder = inOrder; } |
所以我们引入FieldMaskingSpanQuery,SpanQuery q2m = new FieldMaskingSpanQuery(q2, "studentfirstname");
FieldMaskingSpanQuery.getField()得到的是你指定的假的域信息"studentfirstname",从而通过了审核,就可以计算位置信息了。
我们的查询过程如下:
File indexDir = new File("TestFieldMaskingSpanQuery/index"); IndexReader reader = IndexReader.open(FSDirectory.open(indexDir)); IndexSearcher searcher = new IndexSearcher(reader); SpanQuery q1 = new SpanTermQuery(new Term("studentfirstname", "james")); SpanQuery q2 = new SpanTermQuery(new Term("studentsurname", "jones")); SpanQuery q2m = new FieldMaskingSpanQuery(q2, "studentfirstname"); Query query = new SpanNearQuery(new SpanQuery[]{q1, q2m}, -1, false); TopDocs docs = searcher.search(query, 50); for (ScoreDoc doc : docs.scoreDocs) { System.out.println("docid : " + doc.doc + " score : " + doc.score); } |
带Payload前缀的查询对象不会因为payload的存在而使得结果集发生改变,而仅仅改变其评分。
欲使用Payload系列的查询语句:
PayloadFunction需要实现两个接口:
PayloadFunction有三种实现:
对于PayloadTermQuery来讲,在其生成的PayloadTermSpanScorer中:
payloadScore = function.currentScore(doc, term.field(), spans.start(), spans.end(), payloadsSeen, payloadScore, similarity.scorePayload(doc, term.field(), spans.start(), spans.end(), payload, 0, positions.getPayloadLength())); |
protected float getPayloadScore() { return function.docScore(doc, term.field(), payloadsSeen, payloadScore); } |
对于PayloadNearQuery来讲,在其生成的PayloadNearSpanScorer中:
payloadScore = function.currentScore(doc, fieldName, start, end, payloadsSeen, payloadScore, similarity.scorePayload(doc, fieldName, spans.start(), spans.end(), thePayload, 0, thePayload.length) ); |
public float score() throws IOException { return super.score() * function.docScore(doc, fieldName, payloadsSeen, payloadScore); } |