最近要做搜索了,而且公司用的就是lucene,所以自己先学习一番,看了lucene in action和今天买的一本lucene2.0+heritrix,上面对mergeFactor都是这样说的“每向索引添加mergeFactor个document时,就会有一个新的segment在磁盘建立起来......"。而对于minMergeDocs都是一笔带过,说是限制内存中文档的数量。
于是我就开始奇怪了,这两个值这么一来不就冲突了吗,两个值一样的功能,于是乎我就做了几个试验,我有81个document,然后我把mergeFactor设置为5,把minMergeDocs设置为8,把maxMergeDocs设置为45。按照书上的讲,这样每5个doc就会生成一个segment,事实怎么样呢[code]package org.apache.lucene.demo;
[code]
/**
* Copyright 2004 The Apache Software Foundation
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.IndexWriter;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.Date;
class IndexFiles {
public static void main(String[] args) throws IOException {
String usage = "java " + IndexFiles.class + " <root_directory>";
if (args.length == 0) {
System.err.println("Usage: " + usage);
System.exit(1);
}
Date start = new Date();
try {
File INDEX_DIR = new File(args[0]);
if (INDEX_DIR.exists()) {
INDEX_DIR.delete();
}
IndexWriter writer = new IndexWriter("index",
new StandardAnalyzer(), true);
writer.setUseCompoundFile(false);
writer.mergeFactor = 5;
writer.maxMergeDocs = 40;
writer.minMergeDocs = 8;
indexDocs(writer, INDEX_DIR);
// writer.optimize();
writer.close();
Date end = new Date();
System.out.print(end.getTime() - start.getTime());
System.out.println(" total milliseconds");
} catch (IOException e) {
System.out.println(" caught a " + e.getClass()
+ "\n with message: " + e.getMessage());
}
}
public static void indexDocs(IndexWriter writer, File file)
throws IOException {
// do not try to index files that cannot be read
if (file.canRead()) {
if (file.isDirectory()) {
String[] files = file.list();
// an IO error could occur
if (files != null) {
for (int i = 0; i < files.length; i++) {
indexDocs(writer, new File(file, files[i]));
}
}
} else {
try {
if (file.getName().endsWith(".txt")) {
System.out.println("adding " + file);
writer.addDocument(FileDocument.Document(file));
}
}
// at least on windows, some temporary files raise this
// exception with an "access denied" message
// checking if the file can be read doesn't help
catch (FileNotFoundException fnfe) {
;
}
}
}
}
}
[/code]
debug他在 writer.addDocument(FileDocument.Document(file)); writer.addDocument(FileDocument.Document(file));
这里设上断点,然后发现在第5个document添加的时候并没有出现segment生成,而是在第8个document添加的时候出现了第一个segment的生成。接下来再做一个试验把这两个值倒过来,然后你就会发现这次,在第5个document添加的时候出现了第一个segment的生成。
所以我认为,mergeFactor只是控制segment合并的,并不控制多少个document生成一个segement,而minMergeDocs是控制多少个document生成一个segement。
另外附上我自己写的一个计算产生segement数量的算法,写得比较匆忙,可能有不对的地方,另外有一条分支没有验证就是当maxMergeDocs<minMergeDocs时,我试验他就生成了一个segment不知道为啥。