对lucene in action 和其他书里面对于mergeFactor讲解的质疑

java 代码

   最近要做搜索了,而且公司用的就是lucene,所以自己先学习一番,看了lucene in action和今天买的一本lucene2.0+heritrix,上面对mergeFactor都是这样说的“每向索引添加mergeFactor个document时,就会有一个新的segment在磁盘建立起来......"。而对于minMergeDocs都是一笔带过,说是限制内存中文档的数量。
    于是我就开始奇怪了,这两个值这么一来不就冲突了吗,两个值一样的功能,于是乎我就做了几个试验,我有81个document,然后我把mergeFactor设置为5,把minMergeDocs设置为8,把maxMergeDocs设置为45。按照书上的讲,这样每5个doc就会生成一个segment,事实怎么样呢[code]package org.apache.lucene.demo;

[code]

/**
 * Copyright 2004 The Apache Software Foundation
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.IndexWriter;

import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.Date;

class IndexFiles {
 public static void main(String[] args) throws IOException {
  String usage = "java " + IndexFiles.class + " <root_directory>";
  if (args.length == 0) {
   System.err.println("Usage: " + usage);
   System.exit(1);
  }

  Date start = new Date();
  try {
   File INDEX_DIR = new File(args[0]);
   if (INDEX_DIR.exists()) {
    INDEX_DIR.delete();
   }
   IndexWriter writer = new IndexWriter("index",
     new StandardAnalyzer(), true);
   writer.setUseCompoundFile(false);
   writer.mergeFactor = 5;
   writer.maxMergeDocs = 40;
   writer.minMergeDocs = 8;
   indexDocs(writer, INDEX_DIR);

//   writer.optimize();
   writer.close();

   Date end = new Date();

   System.out.print(end.getTime() - start.getTime());
   System.out.println(" total milliseconds");

  } catch (IOException e) {
   System.out.println(" caught a " + e.getClass()
     + "\n with message: " + e.getMessage());
  }
 }

 public static void indexDocs(IndexWriter writer, File file)
   throws IOException {
  // do not try to index files that cannot be read
  if (file.canRead()) {
   if (file.isDirectory()) {
    String[] files = file.list();
    // an IO error could occur
    if (files != null) {
     for (int i = 0; i < files.length; i++) {
      indexDocs(writer, new File(file, files[i]));
     }
    }
   } else {

    try {
     if (file.getName().endsWith(".txt")) {
      System.out.println("adding " + file);
      writer.addDocument(FileDocument.Document(file));
     }
    }
    // at least on windows, some temporary files raise this
    // exception with an "access denied" message
    // checking if the file can be read doesn't help
    catch (FileNotFoundException fnfe) {
     ;
    }
   }
  }
 }
}
[/code]

debug他在 writer.addDocument(FileDocument.Document(file)); writer.addDocument(FileDocument.Document(file));
这里设上断点,然后发现在第5个document添加的时候并没有出现segment生成,而是在第8个document添加的时候出现了第一个segment的生成。接下来再做一个试验把这两个值倒过来,然后你就会发现这次,在第5个document添加的时候出现了第一个segment的生成。

      所以我认为,mergeFactor只是控制segment合并的,并不控制多少个document生成一个segement,而minMergeDocs是控制多少个document生成一个segement。

 

另外附上我自己写的一个计算产生segement数量的算法,写得比较匆忙,可能有不对的地方,另外有一条分支没有验证就是当maxMergeDocs<minMergeDocs时,我试验他就生成了一个segment不知道为啥。

 

java 代码
  1. package com.sina.easy.util;   
  2.   
  3. public class CountSegmentNum {   
  4.     private int docNum = 0;   
  5.   
  6.     private int mergefactor = 10;   
  7.   
  8.     private int maxMergeDocs = Integer.MAX_VALUE;   
  9.   
  10.     private int minMergeDocs = 10;   
  11.   
  12.     private int segmentNum = 0;   
  13.   
  14.     public CountSegmentNum(int docNum, int mergefactor, int maxMergeDocs,   
  15.             int minMergeDocs) {   
  16.         this.docNum = docNum;   
  17.         this.mergefactor = mergefactor;   
  18.         this.maxMergeDocs = maxMergeDocs;   
  19.         this.minMergeDocs = minMergeDocs;   
  20.     }   
  21.   
  22.     public void countNum() {   
  23.         int i = 1;   
  24.         int tempmerfactormulti = mergefactor;   
  25.         while (true) {   
  26.             if (docNum == 0) {   
  27.                 return;   
  28.             }   
  29.             if (docNum < minMergeDocs) {   
  30.                 segmentNum++;   
  31.                 return;   
  32.             }   
  33.             if (maxMergeDocs >= docNum) {   
  34.                 int x = docNum / minMergeDocs;   
  35.                 int z = x % mergefactor;   
  36.                 if (x >= mergefactor) {   
  37.                     segmentNum++;   
  38.                 }   
  39.                 segmentNum += z;   
  40.                 docNum = docNum % minMergeDocs;   
  41.             }else{   
  42.                 if(maxMergeDocs<minMergeDocs)   
  43.                 {   
  44.                     segmentNum = 1;                         //这条分支没有详细验证,不过实际应用应该没人这么用   
  45.                     return;   
  46.                 }   
  47.                 if(maxMergeDocs< tempmerfactormulti*minMergeDocs){   
  48.                     int nowmerfactor = tempmerfactormulti;   
  49.                     for(;i>=1;i--){   
  50.                         nowmerfactor = tempmerfactormulti/mergefactor;   
  51.                         segmentNum+=docNum/(nowmerfactor*minMergeDocs);   
  52.                         docNum = docNum%(nowmerfactor*minMergeDocs);   
  53.                     }   
  54.                 }else{   
  55.                     tempmerfactormulti = tempmerfactormulti*mergefactor;   
  56.                     i++;   
  57.                 }   
  58.             }      
  59.         }   
  60.     }   
  61.   
  62.     public int getSegmentNum() {   
  63.         return segmentNum;   
  64.     }   
  65.   
  66.     public static void main(String[] args) {   
  67.         CountSegmentNum csn = new CountSegmentNum(815604);   
  68.         csn.countNum();   
  69.         System.out.println(csn.getSegmentNum());   
  70.     }   
  71. }   

你可能感兴趣的:(apache,算法,windows,Lucene,Access)