PhraseQuery:短语查询,就是查询文档中是否包含指定的一个Term或多个Term,多个Term之间可以指定间隔即slop参数,官方API解释如图:
package com.yida.framework.lucene5.query; import java.io.IOException; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.TextField; import org.apache.lucene.index.DirectoryReader; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.index.IndexWriterConfig.OpenMode; import org.apache.lucene.index.Term; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.PhraseQuery; import org.apache.lucene.search.ScoreDoc; import org.apache.lucene.search.TopDocs; import org.apache.lucene.store.Directory; import org.apache.lucene.store.RAMDirectory; public class PhraseQueryTest { public static void main(String[] args) throws IOException { Directory dir = new RAMDirectory(); Analyzer analyzer = new StandardAnalyzer(); IndexWriterConfig iwc = new IndexWriterConfig(analyzer); iwc.setOpenMode(OpenMode.CREATE); IndexWriter writer = new IndexWriter(dir, iwc); Document doc = new Document(); doc.add(new TextField("text", "quick brown fox", Field.Store.YES)); writer.addDocument(doc); doc = new Document(); doc.add(new TextField("text", "jumps over lazy broun dog", Field.Store.YES)); writer.addDocument(doc); doc = new Document(); doc.add(new TextField("text", "jumps over extremely very lazy broxn dog", Field.Store.YES)); writer.addDocument(doc); writer.close(); IndexReader reader = DirectoryReader.open(dir); IndexSearcher searcher = new IndexSearcher(reader); String term1 = "dog"; String term2 = "jumps"; PhraseQuery phraseQuery = new PhraseQuery(); phraseQuery.add(new Term("text",term1)); phraseQuery.add(new Term("text",term2)); phraseQuery.setSlop(15); TopDocs results = searcher.search(phraseQuery, null, 100); ScoreDoc[] scoreDocs = results.scoreDocs; for (int i = 0; i < scoreDocs.length; ++i) { //System.out.println(searcher.explain(query, scoreDocs[i].doc)); int docID = scoreDocs[i].doc; Document document = searcher.doc(docID); String path = document.get("text"); System.out.println("text:" + path); } } }
pharseQuery.add(term),每次都是add到末尾,当然你也可以用add(term,position)明确指定add到哪个位置,示例代码中add了两个Term,则我们的查询短语是dog jumps,他们的间隔为0,然后我们设置slop值为5,
第2个索引文档里单词jumps往右移动5次刚好可以得到我们的查询短语dog jumps,因此它符合要求被返回了,而第1个索引文档直接不包含单词dog不符合要求,第3个索引文档需要移动7次才能得到dog jumps,所以最后返回的只有第2个索引文档。
如果我把代码变一下,改成这样:
String term1 = "dog"; String term2 = "jumps"; PhraseQuery phraseQuery = new PhraseQuery(); phraseQuery.add(new Term("text",term1),0); phraseQuery.add(new Term("text",term2),2); phraseQuery.setSlop(6); TopDocs results = searcher.search(phraseQuery, null, 100);
这时候我们的查询短语就是dog xxx jumps,意思就是我们要查询包含dog和jumps字符的文档而且dog和jumps之间要有一个字符间隔(不包含停用词),这时候我们的slop就要加1了,即我们需要再多移动一次,所以这次slop值应该为6.
PharseQuery下还有一个子类NGramPhraseQuery,这个子类涉及到N-Gram模型,算法之类的我就略过了。
如果你还有什么问题请加我Q-Q:7-3-6-0-3-1-3-0-5,
或者加裙
一起交流学习!