为了方便分析,这里再贴一次在上一章中lucene关于建立索引的实例的源代码,
String filePath = ...//文件路径
String indexPath = ...//索引路径
File fileDir = new File(filePath);
Directory dir = FSDirectory.open(Paths.get(indexPath));
Analyzer luceneAnalyzer = new SimpleAnalyzer();
IndexWriterConfig iwc = new IndexWriterConfig(luceneAnalyzer);
iwc.setOpenMode(OpenMode.CREATE);
IndexWriter indexWriter = new IndexWriter(dir,iwc);
File[] textFiles = fileDir.listFiles();
for (int i = 0; i < textFiles.length; i++) {
if (textFiles[i].isFile()) {
String temp = FileReaderAll(textFiles[i].getCanonicalPath(),
"GBK");
Document document = new Document();
Field FieldPath = new StringField("path", textFiles[i].getPath(), Field.Store.YES);
Field FieldBody = new TextField("body", temp, Field.Store.YES);
document.add(FieldPath);
document.add(FieldBody);
indexWriter.addDocument(document);
}
}
indexWriter.close();
首先,FSDirectory的open函数用来打开索引文件夹,用来存放后面生成的索引文件,代码如下,
public static FSDirectory open(Path path) throws IOException {
return open(path, FSLockFactory.getDefault());
}
public static FSDirectory open(Path path, LockFactory lockFactory) throws IOException {
if (Constants.JRE_IS_64BIT && MMapDirectory.UNMAP_SUPPORTED) {
return new MMapDirectory(path, lockFactory);
} else if (Constants.WINDOWS) {
return new SimpleFSDirectory(path, lockFactory);
} else {
return new NIOFSDirectory(path, lockFactory);
}
}
FSLockFactory获得的默认LockFactory是NativeFSLockFactory,该工厂可以获得文件锁NativeFSLock,后面如果分析到再来细看这方面代码。这里假设FSDirectory的open函数创建了一个NIOFSDirectory,NIOFSDirectory继承自FSDirectory,并且直接调用了其父类FSDirectory的构造函数,
protected FSDirectory(Path path, LockFactory lockFactory) throws IOException {
super(lockFactory);
if (!Files.isDirectory(path)) {
Files.createDirectories(path);
}
directory = path.toRealPath();
}
FSDirectory的构造函数根据Path创建了一个目录或者文件,并且保存了对应的路径。FSDirectory继承自BaseDirectory,其构造函数只是简单保存了LockFactory,这里就不要往下看了。
回到最上面的例子中,接下来构造了SimpleAnalyzer,然后根据构造的SimpleAnalyzer创建一个IndexWriterConfig,其构造函数直接调用了其父类LiveIndexWriterConfig的构造函数,
LiveIndexWriterConfig(Analyzer analyzer) {
this.analyzer = analyzer;
ramBufferSizeMB = IndexWriterConfig.DEFAULT_RAM_BUFFER_SIZE_MB;
maxBufferedDocs = IndexWriterConfig.DEFAULT_MAX_BUFFERED_DOCS;
maxBufferedDeleteTerms = IndexWriterConfig.DEFAULT_MAX_BUFFERED_DELETE_TERMS;
mergedSegmentWarmer = null;
delPolicy = new KeepOnlyLastCommitDeletionPolicy();
commit = null;
useCompoundFile = IndexWriterConfig.DEFAULT_USE_COMPOUND_FILE_SYSTEM;
openMode = OpenMode.CREATE_OR_APPEND;
similarity = IndexSearcher.getDefaultSimilarity();
mergeScheduler = new ConcurrentMergeScheduler();
indexingChain = DocumentsWriterPerThread.defaultIndexingChain;
codec = Codec.getDefault();
infoStream = InfoStream.getDefault();
mergePolicy = new TieredMergePolicy();
flushPolicy = new FlushByRamOrCountsPolicy();
readerPooling = IndexWriterConfig.DEFAULT_READER_POOLING;
indexerThreadPool = new DocumentsWriterPerThreadPool();
perThreadHardLimitMB = IndexWriterConfig.DEFAULT_RAM_PER_THREAD_HARD_LIMIT_MB;
}
LiveIndexWriterConfig构造函数又创建并保存了一系列组件,在后面的代码分析中如果碰到会一一分析,这里就不往下看了。
回到lucene实例中,接下来根据刚刚创建的LiveIndexWriterConfig创建一个IndexWriter,IndexWriter时lucene创建索引最为核心的类,其构造函数比较长,下面一一来看,
public IndexWriter(Directory d, IndexWriterConfig conf) throws IOException {
if (d instanceof FSDirectory && ((FSDirectory) d).checkPendingDeletions()) {
throw new IllegalArgumentException();
}
conf.setIndexWriter(this);
config = conf;
infoStream = config.getInfoStream();
writeLock = d.obtainLock(WRITE_LOCK_NAME);
boolean success = false;
try {
directoryOrig = d;
directory = new LockValidatingDirectoryWrapper(d, writeLock);
mergeDirectory = addMergeRateLimiters(directory);
analyzer = config.getAnalyzer();
mergeScheduler = config.getMergeScheduler();
mergeScheduler.setInfoStream(infoStream);
codec = config.getCodec();
bufferedUpdatesStream = new BufferedUpdatesStream(infoStream);
poolReaders = config.getReaderPooling();
OpenMode mode = config.getOpenMode();
boolean create;
if (mode == OpenMode.CREATE) {
create = true;
} else if (mode == OpenMode.APPEND) {
create = false;
} else {
create = !DirectoryReader.indexExists(directory);
}
boolean initialIndexExists = true;
String[] files = directory.listAll();
IndexCommit commit = config.getIndexCommit();
StandardDirectoryReader reader;
if (commit == null) {
reader = null;
} else {
reader = commit.getReader();
}
if (create) {
if (config.getIndexCommit() != null) {
if (mode == OpenMode.CREATE) {
throw new IllegalArgumentException();
} else {
throw new IllegalArgumentException();
}
}
SegmentInfos sis = null;
try {
sis = SegmentInfos.readLatestCommit(directory);
sis.clear();
} catch (IOException e) {
initialIndexExists = false;
sis = new SegmentInfos();
}
segmentInfos = sis;
rollbackSegments = segmentInfos.createBackupSegmentInfos();
changed();
} else if (reader != null) {
...
} else {
...
}
pendingNumDocs.set(segmentInfos.totalMaxDoc());
globalFieldNumberMap = getFieldNumberMap();
config.getFlushPolicy().init(config);
docWriter = new DocumentsWriter(this, config, directoryOrig, directory);
eventQueue = docWriter.eventQueue();
synchronized(this) {
deleter = new IndexFileDeleter(files, directoryOrig, directory,
config.getIndexDeletionPolicy(),
segmentInfos, infoStream, this,
initialIndexExists, reader != null);
assert create || filesExist(segmentInfos);
}
if (deleter.startingCommitDeleted) {
changed();
}
if (reader != null) {
...
}
success = true;
} finally {
if (!success) {
IOUtils.closeWhileHandlingException(writeLock);
writeLock = null;
}
}
}
IndexWriter构造函数首先通过checkPendingDeletions函数删除被标记的文件,checkPendingDeletions函数定义在FSDirectory中,如下所示
public boolean checkPendingDeletions() throws IOException {
deletePendingFiles();
return pendingDeletes.isEmpty() == false;
}
public synchronized void deletePendingFiles() throws IOException {
if (pendingDeletes.isEmpty() == false) {
for(String name : new HashSet<>(pendingDeletes)) {
privateDeleteFile(name, true);
}
}
}
private void privateDeleteFile(String name, boolean isPendingDelete) throws IOException {
try {
Files.delete(directory.resolve(name));
pendingDeletes.remove(name);
} catch (NoSuchFileException | FileNotFoundException e) {
} catch (IOException ioe) {
}
}
checkPendingDeletions函数最后调用Files的delete函数删除保存在pendingDeletes的文件。
回到IndexWriter的构造函数中,接下来通过infoStream获得在LiveIndexWriterConfig构造函数中创建的NoOutput,该infoStream用来显示信息,然后调用FSDirectory的obtainLock函数获得文件的写锁,这里就不往下分析了。
回到IndexWriter的构造函数中,接下来会经过一系列的创建和赋值操作,假设create为true,即表示第一次创建或者重新创建索引,然后会通过SegmentInfos的readLatestCommit函数读取段信息,
public static final SegmentInfos readLatestCommit(Directory directory) throws IOException {
return new FindSegmentsFile(directory) {
@Override
protected SegmentInfos doBody(String segmentFileName) throws IOException {
return readCommit(directory, segmentFileName);
}
}.run();
}
SegmentInfos的readLatestCommit函数创建了一个FindSegmentsFile并调用其run函数,定义如下,
public T run() throws IOException {
return run(null);
}
public T run(IndexCommit commit) throws IOException {
long lastGen = -1;
long gen = -1;
IOException exc = null;
for (;;) {
lastGen = gen;
String files[] = directory.listAll();
String files2[] = directory.listAll();
Arrays.sort(files);
Arrays.sort(files2);
if (!Arrays.equals(files, files2)) {
continue;
}
gen = getLastCommitGeneration(files);
if (gen == -1) {
throw new IndexNotFoundException();
} else if (gen > lastGen) {
String segmentFileName = IndexFileNames.fileNameFromGeneration(IndexFileNames.SEGMENTS, "", gen);
try {
T t = doBody(segmentFileName);
return t;
} catch (IOException err) {
}
} else {
throw exc;
}
}
}
这里的泛型T就是SegmentInfos,run函数首先调用getLastCommitGeneration函数获得gen信息,假设索引文件夹下有一个文件名为segments_6的文件,则getLastCommitGeneration最后会返回6赋值到gen中,接下来,如果gen大于lastGen,就表示段信息有更新了,这时候就要通过doBody函数读取该segments_6文件的信息,并返回一个SegmentInfos。
根据前面readLatestCommit的代码,doBody函数最后会调用readCommit函数,定义在SegmentInfos中,代码如下
public static final SegmentInfos readCommit(Directory directory, String segmentFileName) throws IOException {
long generation = generationFromSegmentsFileName(segmentFileName);
try (ChecksumIndexInput input = directory.openChecksumInput(segmentFileName, IOContext.READ)) {
return readCommit(directory, input, generation);
}
}
readCommit函数首先创建一个ChecksumIndexInput,然后通过readCommit函数读取段信息并返回一个SegmentInfos,这里的readCommit函数和具体的segments_*文件格式和协议相关,这里就不往下看了。最后返回的SegmentInfos保存了段信息。
回到IndexWriter的构造函数中,如果readLatestCommit函数返回的SegmentInfos不为空,就调用其clear清空,如果是第一次创建索引,就会构造一个SegmentInfos,SegmentInfos的构造函数为空函数。接下来调用SegmentInfos的createBackupSegmentInfos函数备份其中的SegmentCommitInfo信息列表,该备份主要是为了回滚rollback操作使用。IndexWriter然后调用changed表示段信息发生了变化。
继续往下看IndexWriter的构造函数,pendingNumDocs函数记录了索引记录的文档总数,globalFieldNumberMap记录了该段中Field的相关信息,getFlushPolicy返回在LiveIndexWriterConfig构造函数中创建的FlushByRamOrCountsPolicy,然后通过FlushByRamOrCountsPolicy的init函数进行简单的赋值。再往下创建了一个DocumentsWriter,并获得其事件队列保存在eventQueue中。IndexWriter的构造函数接下来会创建一个IndexFileDeleter,IndexFileDeleter用来管理索引文件,例如添加引用计数,在多线程环境下操作索引文件时可以保持同步性。
下一章继续分析lucene创建索引的实例的源代码。