我们先来看看索引创建的事例代码:
Directory directory = FSDirectory.getDirectory("/tmp/testindex"); // Use standard analyzer
Analyzer analyzer = new StandardAnalyzer(); // Create IndexWriter object
IndexWriter iwriter = new IndexWriter(directory, analyzer, true);
iwriter.setMaxFieldLength(25000); // make a new, empty document
Document doc = new Document();
File f = new File("/tmp/test.txt");
// Add the path of the file as a field named "path". Use a field that is // indexed (i.e. searchable), but don't tokenize the field into words.
doc.add(new Field("path", f.getPath(), Field.Store.YES, Field.Index.UN_TOKENIZED));
String text = "This is the text to be indexed.";
doc.add(new Field("fieldname", text, Field.Store.YES,Field.Index.TOKENIZED));
// Add the last modified date of the file a field named "modified". Use // a field that is indexed (i.e. searchable), but don't tokenize the field // into words.
doc.add(new Field("modified",DateTools.timeToString(f.lastModified(), DateTools.Resolution.MINUTE),Field.Store.YES, Field.Index.UN_TOKENIZED));
// Add the contents of the file to a field named "contents". Specify a Reader, // so that the text of the file is tokenized and indexed, but not stored.
// Note that FileReader expects the file to be in the system's default encoding. // If that's not the case searching for special characters will fail.
doc.add(new Field("contents", new FileReader(f)));
iwriter.addDocument(doc);
iwriter.optimize();
iwriter.close();
从代码中可以看出来索引index的创建主要是在IndexWriter中进行的。IndexWriter的调用关系如下图所示:
最终生成索引文件。
.fdx是field索引文件,.fdt是field数据文件,.nrm是Norms调节因子文件,计算文档得分用的,.tvf是term向量文件之一,保存了term列表、词频还有可选的位置和偏移信息,.tvx存储在.tvf域文件和.tvd文档数据文件中的偏移量,.tvd是field数据文件,它包含fields的数目,有term向量的fields的列表,还有指向term向量域文件(.tvf)中的域信息的指针列表。该文件用于映射(map out)出那些存储了term向量的fields,以及这些field信息在.tvf文件中的位置。.prx文件是位置信息数据文件容纳了每一个term出现在所有文档中的位置的列表。.tti/.tis分别是term信息索引文件和term信息数据文件。
知道了IndexWriter的调用关系,那么它的源码究竟是怎么样的呢?接下来我们就来分析索引创建的相关源码。IndexWriter的addDocument函数最终是调用DocementWriter的updateDocument函数,先上updateDocument函数的图:
boolean updateDocument(Iterable extends IndexableField> doc, Analyzer analyzer, Term delTerm) throws IOException, AbortingException {
//预处理,下面会讲这个函数的作用
boolean hasEvents = this.preUpdate();
//获取锁
ThreadState perThread = this.flushControl.obtainAndLock();
DocumentsWriterPerThread flushingDWPT;
try {
//确定文档已经打开
this.ensureOpen();
this.ensureInitialized(perThread);
assert perThread.isInitialized();
//异步flush内存中已经存在的文档到磁盘
DocumentsWriterPerThread dwpt = perThread.dwpt;
int dwptNumDocs = dwpt.getNumDocsInRAM();
try {
dwpt.updateDocument(doc, analyzer, delTerm);
} catch (AbortingException var18) {
this.flushControl.doOnAbort(perThread);
dwpt.abort();
throw var18;
} finally {
//获取还在内存中的文档的数目
this.numDocsInRAM.addAndGet(dwpt.getNumDocsInRAM() - dwptNumDocs);
}
boolean isUpdate = delTerm != null;
//后置处理
flushingDWPT = this.flushControl.doAfterDocument(perThread, isUpdate);
} finally {
//释放线程池中的当前使用线程
this.perThreadPool.release(perThread);
}
//后置刷新
return this.postUpdate(flushingDWPT, hasEvents);
}
下面看看前置update处理和后置update处理
private boolean preUpdate() throws IOException, AbortingException {
this.ensureOpen();
boolean hasEvents = false;
//如果存在停滞的线程或待刷新队列有内容
if(this.flushControl.anyStalledThreads() || this.flushControl.numQueuedFlushes() > 0) {
//如果当前输出流具有删除和写入权限
if(this.infoStream.isEnabled("DW")) {
this.infoStream.message("DW", "DocumentsWriter has queued dwpt; will hijack this thread to flush pending segment(s)");
}
//多个线程不断将segment同步地写入到directory中去
while(true) {
DocumentsWriterPerThread flushingDWPT;
while((flushingDWPT = this.flushControl.nextPendingFlush()) == null) {
if(this.infoStream.isEnabled("DW") && this.flushControl.anyStalledThreads()) {
this.infoStream.message("DW", "WARNING DocumentsWriter has stalled threads; waiting");
}
this.flushControl.waitIfStalled();
if(this.flushControl.numQueuedFlushes() == 0) {
if(this.infoStream.isEnabled("DW")) {
this.infoStream.message("DW", "continue indexing after helping out flushing DocumentsWriter is healthy");
}
return hasEvents;
}
}
hasEvents |= this.doFlush(flushingDWPT);
}
} else {
return hasEvents;
}
}
private boolean postUpdate(DocumentsWriterPerThread flushingDWPT, boolean hasEvents) throws IOException, AbortingException {
//如果有待刷新的segment在内存中,那么把它们刷入文件
hasEvents |= this.applyAllDeletes(this.deleteQueue);
if(flushingDWPT != null) {
hasEvents |= this.doFlush(flushingDWPT);
} else {
DocumentsWriterPerThread nextPendingFlush = this.flushControl.nextPendingFlush();
if(nextPendingFlush != null) {
hasEvents |= this.doFlush(nextPendingFlush);
}
}
return hasEvents;
}
public void updateDocument(Iterable extends IndexableField> doc, Analyzer analyzer, Term delTerm) throws IOException, AbortingException {
this.testPoint("DocumentsWriterPerThread addDocument start");
assert this.deleteQueue != null;
this.reserveOneDoc();
this.docState.doc = doc;
this.docState.analyzer = analyzer;
this.docState.docID = this.numDocsInRAM;
boolean success = false;
try {
try {
this.consumer.processDocument();
} finally {
this.docState.clear();
}
success = true;
} finally {
if(!success) {
this.deleteDocID(this.docState.docID);
++this.numDocsInRAM;
}
}
this.finishDocument(delTerm);
}
DocumentWriter会分配不同的线程去处理内存中的document,并挨个分析doc中的Fields创建对应的索引文件。这样索引文件就生成保存在磁盘上了,consumer利用analyzer将Document中不同的fields分成不同的term创建索引的细节可以参照上一章讲的。