Lucene中MoreLikeThis查询结果为空

    本文只是给lucene小白看的,老鸟勿喷

    刚开始学lucene,所以直接使用新版的7.1.0,还在搭建helloworld中,使用MoreListThis查询时结果为空,

怎么都查询不出数据,百度也找不到答案,而且百度出来的内容都是老版本的。无奈只能自己看源码了。

谁有新版本的各种demo请分享下,自己摸索太累了,先谢谢了。

    由于用的最新版,找不到demo,所以所有代码都是用的API中的示例代码,如MoreLikeThis API中的

示例如下:

 IndexReader ir = ...
 IndexSearcher is = ...

 MoreLikeThis mlt = new MoreLikeThis(ir);
 Reader target = ... // orig source of doc you want to find similarities to
 Query query = mlt.like( target);
 
 Hits hits = is.search(query);
 // now the usual iteration thru 'hits' - the only thing to watch for is to make sure
 //you ignore the doc if it matches your 'target' document, as it should be similar to itself
    照葫芦画瓢,测试代码如下
    @Test
    public void testMoreLikeThis() throws IOException {
        final Path path = Paths.get(INDEX_DIR);
        Directory directory = FSDirectory.open(path);
        Analyzer analyzer = new StandardAnalyzer();

        IndexReader indexReader = DirectoryReader.open(directory);
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);
        MoreLikeThis mlt = new MoreLikeThis(indexReader);
        mlt.setAnalyzer(analyzer);//这个要加上,不然直接报错
        Query query = mlt.like("content",new StringReader("your doc content to search"));

        TopDocs topDocs = indexSearcher.search(query, 10);
        long conut = topDocs.totalHits;
        System.out.println("检索总条数:"+conut);
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        for (ScoreDoc scoreDoc : scoreDocs) {
            Document document = indexSearcher.doc(scoreDoc.doc);
            System.out.print("相关度:"+scoreDoc.score);
            System.out.println(document.get("content"));
        }
    }

    执行下查询出的记录数为0,而我查询的内容肯定是在库中的。看源码:

  /**
   * Return a query that will return docs like the passed Readers.
   * This was added in order to treat multi-value fields.
   *
   * @return a query that will return docs like the passed Readers.
   */
  public Query like(String fieldName, Reader... readers) throws IOException {
    Map> perFieldTermFrequencies = new HashMap<>();
    for (Reader r : readers) {
      addTermFrequencies(r, perFieldTermFrequencies, fieldName);//这里添加要查询的内容的词频
    }
    return createQuery(createQueue(perFieldTermFrequencies));
  }

    看添加词频的源码:

  /**
   * Adds term frequencies found by tokenizing text from reader into the Map words
   *
   * @param r a source of text to be tokenized
   * @param perFieldTermFrequencies a Map of terms and their frequencies per field
   * @param fieldName Used by analyzer for any special per-field analysis
   */
  private void addTermFrequencies(Reader r, Map> perFieldTermFrequencies, String fieldName)
      throws IOException {
    if (analyzer == null) {//这里就是刚才说的要设置analyzer,不然会报这个错,而api中的示例代码是没有的
      throw new UnsupportedOperationException("To use MoreLikeThis without " +
          "term vectors, you must provide an Analyzer");
    }
    Map termFreqMap = perFieldTermFrequencies.get(fieldName);
    if (termFreqMap == null) {
      termFreqMap = new HashMap<>();
      perFieldTermFrequencies.put(fieldName, termFreqMap);
    }
    try (TokenStream ts = analyzer.tokenStream(fieldName, r)) {
      int tokenCount = 0;
      // for every token
      CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
      ts.reset();
      while (ts.incrementToken()) {
        String word = termAtt.toString();
        tokenCount++;
        if (tokenCount > maxNumTokensParsed) {
          break;
        }
        if (isNoiseWord(word)) {
          continue;
        }

        // increment frequency
        Int cnt = termFreqMap.get(word);
        if (cnt == null) {
          termFreqMap.put(word, new Int());
        } else {
          cnt.x++;
        }
      }
      ts.end();
    }
  }

这里主要就是创建词频,直接看下一步,根据词频创建搜索内容:

  /**
   * Create a PriorityQueue from a word->tf map.
   *
   * @param perFieldTermFrequencies a per field map of words keyed on the word(String) with Int objects as the values.
   */
  private PriorityQueue createQueue(Map> perFieldTermFrequencies) throws IOException {
    // have collected all words in doc and their freqs
    int numDocs = ir.numDocs();
    final int limit = Math.min(maxQueryTerms, this.getTermsCount(perFieldTermFrequencies));
    FreqQ queue = new FreqQ(limit); // will order words by score
    for (Map.Entry> entry : perFieldTermFrequencies.entrySet()) {
      Map perWordTermFrequencies = entry.getValue();
      String fieldName = entry.getKey();

      for (Map.Entry tfEntry : perWordTermFrequencies.entrySet()) { // for every word
        String word = tfEntry.getKey();
        int tf = tfEntry.getValue().x; // term freq in the source doc文档中的词频,即这个word在这个doc中出现的次数
        if (minTermFreq > 0 && tf < minTermFreq) {//如果小于设定的最小次数,直接忽略
          continue; // filter out words that don't occur enough times in the source
        }

        int docFreq = ir.docFreq(new Term(fieldName, word));//根据word搜索这个WORD在文档中出现的频率,即在多少个文档中出现过

        if (minDocFreq > 0 && docFreq < minDocFreq) {//如果小于设定的最小次数,直接忽略
          continue; // filter out words that don't occur in enough docs
        }

        if (docFreq > maxDocFreq) {
          continue; // filter out words that occur in too many docs
        }

        if (docFreq == 0) {
          continue; // index update problem?
        }

        float idf = similarity.idf(docFreq, numDocs);
        float score = tf * idf;

        if (queue.size() < limit) {
          // there is still space in the queue
          queue.add(new ScoreTerm(word, fieldName, score, idf, docFreq, tf));
        } else {
          ScoreTerm term = queue.top();
          if (term.score < score) { // update the smallest in the queue in place and update the queue.
            term.update(word, fieldName, score, idf, docFreq, tf);
            queue.updateTop();
          }
        }
      }
    }
    return queue;
  }

    根据我上面写的注释就知道问题出在哪了,没有设置最小频率,而由于我这个是HELLOWORLD程序,所以数据也不多,我只录入了5条记录,而且内容都不一样,所以idf肯定是达不到要求的。然后修改自己的代码,设置最小频率:

mlt.setMinTermFreq(1);
mlt.setMinDocFreq(1);

然后再运行测试,终于查出来了。


你可能感兴趣的:(lucene)