本文只是给lucene小白看的,老鸟勿喷
刚开始学lucene,所以直接使用新版的7.1.0,还在搭建helloworld中,使用MoreListThis查询时结果为空,
怎么都查询不出数据,百度也找不到答案,而且百度出来的内容都是老版本的。无奈只能自己看源码了。
谁有新版本的各种demo请分享下,自己摸索太累了,先谢谢了。
由于用的最新版,找不到demo,所以所有代码都是用的API中的示例代码,如MoreLikeThis API中的
示例如下:
IndexReader ir = ...
IndexSearcher is = ...
MoreLikeThis mlt = new MoreLikeThis(ir);
Reader target = ... // orig source of doc you want to find similarities to
Query query = mlt.like( target);
Hits hits = is.search(query);
// now the usual iteration thru 'hits' - the only thing to watch for is to make sure
//you ignore the doc if it matches your 'target' document, as it should be similar to itself
照葫芦画瓢,测试代码如下
@Test
public void testMoreLikeThis() throws IOException {
final Path path = Paths.get(INDEX_DIR);
Directory directory = FSDirectory.open(path);
Analyzer analyzer = new StandardAnalyzer();
IndexReader indexReader = DirectoryReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
MoreLikeThis mlt = new MoreLikeThis(indexReader);
mlt.setAnalyzer(analyzer);//这个要加上,不然直接报错
Query query = mlt.like("content",new StringReader("your doc content to search"));
TopDocs topDocs = indexSearcher.search(query, 10);
long conut = topDocs.totalHits;
System.out.println("检索总条数:"+conut);
ScoreDoc[] scoreDocs = topDocs.scoreDocs;
for (ScoreDoc scoreDoc : scoreDocs) {
Document document = indexSearcher.doc(scoreDoc.doc);
System.out.print("相关度:"+scoreDoc.score);
System.out.println(document.get("content"));
}
}
执行下查询出的记录数为0,而我查询的内容肯定是在库中的。看源码:
/**
* Return a query that will return docs like the passed Readers.
* This was added in order to treat multi-value fields.
*
* @return a query that will return docs like the passed Readers.
*/
public Query like(String fieldName, Reader... readers) throws IOException {
Map> perFieldTermFrequencies = new HashMap<>();
for (Reader r : readers) {
addTermFrequencies(r, perFieldTermFrequencies, fieldName);//这里添加要查询的内容的词频
}
return createQuery(createQueue(perFieldTermFrequencies));
}
看添加词频的源码:
/**
* Adds term frequencies found by tokenizing text from reader into the Map words
*
* @param r a source of text to be tokenized
* @param perFieldTermFrequencies a Map of terms and their frequencies per field
* @param fieldName Used by analyzer for any special per-field analysis
*/
private void addTermFrequencies(Reader r, Map> perFieldTermFrequencies, String fieldName)
throws IOException {
if (analyzer == null) {//这里就是刚才说的要设置analyzer,不然会报这个错,而api中的示例代码是没有的
throw new UnsupportedOperationException("To use MoreLikeThis without " +
"term vectors, you must provide an Analyzer");
}
Map termFreqMap = perFieldTermFrequencies.get(fieldName);
if (termFreqMap == null) {
termFreqMap = new HashMap<>();
perFieldTermFrequencies.put(fieldName, termFreqMap);
}
try (TokenStream ts = analyzer.tokenStream(fieldName, r)) {
int tokenCount = 0;
// for every token
CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
ts.reset();
while (ts.incrementToken()) {
String word = termAtt.toString();
tokenCount++;
if (tokenCount > maxNumTokensParsed) {
break;
}
if (isNoiseWord(word)) {
continue;
}
// increment frequency
Int cnt = termFreqMap.get(word);
if (cnt == null) {
termFreqMap.put(word, new Int());
} else {
cnt.x++;
}
}
ts.end();
}
}
这里主要就是创建词频,直接看下一步,根据词频创建搜索内容:
/**
* Create a PriorityQueue from a word->tf map.
*
* @param perFieldTermFrequencies a per field map of words keyed on the word(String) with Int objects as the values.
*/
private PriorityQueue createQueue(Map> perFieldTermFrequencies) throws IOException {
// have collected all words in doc and their freqs
int numDocs = ir.numDocs();
final int limit = Math.min(maxQueryTerms, this.getTermsCount(perFieldTermFrequencies));
FreqQ queue = new FreqQ(limit); // will order words by score
for (Map.Entry> entry : perFieldTermFrequencies.entrySet()) {
Map perWordTermFrequencies = entry.getValue();
String fieldName = entry.getKey();
for (Map.Entry tfEntry : perWordTermFrequencies.entrySet()) { // for every word
String word = tfEntry.getKey();
int tf = tfEntry.getValue().x; // term freq in the source doc文档中的词频,即这个word在这个doc中出现的次数
if (minTermFreq > 0 && tf < minTermFreq) {//如果小于设定的最小次数,直接忽略
continue; // filter out words that don't occur enough times in the source
}
int docFreq = ir.docFreq(new Term(fieldName, word));//根据word搜索这个WORD在文档中出现的频率,即在多少个文档中出现过
if (minDocFreq > 0 && docFreq < minDocFreq) {//如果小于设定的最小次数,直接忽略
continue; // filter out words that don't occur in enough docs
}
if (docFreq > maxDocFreq) {
continue; // filter out words that occur in too many docs
}
if (docFreq == 0) {
continue; // index update problem?
}
float idf = similarity.idf(docFreq, numDocs);
float score = tf * idf;
if (queue.size() < limit) {
// there is still space in the queue
queue.add(new ScoreTerm(word, fieldName, score, idf, docFreq, tf));
} else {
ScoreTerm term = queue.top();
if (term.score < score) { // update the smallest in the queue in place and update the queue.
term.update(word, fieldName, score, idf, docFreq, tf);
queue.updateTop();
}
}
}
}
return queue;
}
根据我上面写的注释就知道问题出在哪了,没有设置最小频率,而由于我这个是HELLOWORLD程序,所以数据也不多,我只录入了5条记录,而且内容都不一样,所以idf肯定是达不到要求的。然后修改自己的代码,设置最小频率:
mlt.setMinTermFreq(1);
mlt.setMinDocFreq(1);
然后再运行测试,终于查出来了。