Lucene 的相关操作问答
一. 利用IK分词分析器索引:
Lucene 完全支持 IK ,只要new一个IKAnylizy,放进indexWtiterConf即可。
二. 利用luke查看索引
下载对应版本的luke,然后放进索引数据的目录,打开jar程序即可查看。
三. 删除,修改索引
从lucene4开始,删除的方法放在了indexwriter。具体查看api。而一般修改的方法是:先删除再重建。
四. 优化索引(优化限制索引文件数量的生成)
利用indexWriter.ForceMerge(int num)
五. 高亮显示
5.1 高亮显示要添加额外的jar包,api文档也在另外一个包
5.2 查看构造方法:
Highlighter(
Formatter formatter,
Scorer fragmentScorer)
可以知道这个highlight需要formattar,和scorer。其中formattar可以使用simpleHtmlFormattar,而scorer可以使用
QueryScorer(
Query query)
。
例子如下:
public static voidhighlightSearch(String indexPath) throwsException{ File indexFile = newFile(indexPath); IndexReader ir = IndexReader.open(FSDirectory.open(indexFile)); IndexSearcher is = new IndexSearcher(ir); IKAnalyzer ikAnalyzer = newIKAnalyzer(true); Query query = newTermQuery(new Term("content","圣诞节")); //高亮设置 Formatter formatter = newSimpleHTMLFormatter("<B>", "</B>"); QueryTermScorer scorer = newQueryTermScorer(query); Highlighter highlighter =newHighlighter(formatter, scorer); //查找id,search()方法只会返回前n个的基本信息(id,得分),得到id后再用is.doc()来精确查找 TopDocs topdocs = is.search(query, 10); ScoreDoc[] docs = topdocs.scoreDocs; int hits =topdocs.totalHits; System.out.println("total:"+hits); for(ScoreDoc doc : docs){ int docID = doc.doc; //精确查找 Document document = is.doc(docID); //对document的内容再次进行加工 String title =highlighter.getBestFragment(ikAnalyzer, "title",document.get("title")); String content =highlighter.getBestFragment(ikAnalyzer, "content",document.get("content")); if(title ==null) title =document.get("title"); if(content== null) content = document.get("content"); System.out.println(document.get("id")); System.out.println(title); System.out.println(content); } }
六. 各种query解析
1 termQuery
A Query thatmatches documents containing a term. This may be combined with other terms witha BooleanQuery
.
2 booleanQuery
A Query thatmatches documents containing a term. This may be combined with other terms witha BooleanQuery
.
3 wildcardQuery
publicclass WildcardQuery
extendsAutomatonQuery
Implementsthe wildcard search query. Supported wildcards are *, which matches any character sequence(including the empty one), and ?, which matches any single character. '\' is the escape character.
Notethis query can be slow, as it needs to iterate over many terms. In order toprevent extremely slow WildcardQueries, a Wildcard term should not start withthe wildcard *
4 phraseQuery
A Query thatmatches documents containing a particular sequence of terms. A PhraseQuery isbuilt by QueryParser for input like "new york"
.
phraseQuery与termquery的区别是:phraseQuery.add(Term term,int position) 解析为:
publicvoid add(Term term,
int position)
Adds a term to the end of the query phrase. The relativeposition of the term within the phrase is specified explicitly. This allowse.g. phrases with more than one term at the same position or phrases with gaps(e.g. in connection with stopwords).
意思是说短语与短语之间的间隔。
Setslop(int position);
5 MultiphraseQuery
MultiPhraseQueryis a generalized version of PhraseQuery, with an added method add(Term[])
. To use this class, to search for the phrase"Microsoft app*" first use add(Term) on the term"Microsoft", then find all terms that have "app" as prefixusing IndexReader.terms(Term), and use MultiPhraseQuery.add(Term[] terms) toadd them to the query.
6 FuzzyQuery
7 RegexpQuery
正则表达式,只对索引词进行匹配。