为了方便分析,这里继续贴出第一章中给出的lucene中关于创建索引的实例,
String filePath = ...//文件路径
String indexPath = ...//索引路径
File fileDir = new File(filePath);
Directory dir = FSDirectory.open(Paths.get(indexPath));
Analyzer luceneAnalyzer = new SimpleAnalyzer();
IndexWriterConfig iwc = new IndexWriterConfig(luceneAnalyzer);
iwc.setOpenMode(OpenMode.CREATE);
IndexWriter indexWriter = new IndexWriter(dir,iwc);
File[] textFiles = fileDir.listFiles();
for (int i = 0; i < textFiles.length; i++) {
if (textFiles[i].isFile()) {
String temp = FileReaderAll(textFiles[i].getCanonicalPath(),
"GBK");
Document document = new Document();
Field FieldPath = new StringField("path", textFiles[i].getPath(), Field.Store.YES);
Field FieldBody = new TextField("body", temp, Field.Store.YES);
document.add(FieldPath);
document.add(FieldBody);
indexWriter.addDocument(document);
}
}
indexWriter.close();
这段代码中,FileReaderAll函数用来从文件中读取字符串,默认编码为“GBK”。在创建完最重要的IndexWriter之后,就开始遍历需要索引的文件,构造对应的Document和Filed类,最终通过IndexWriter的addDocument函数开始索引。
Document的构造函数为空,StringField、TextField和Field的构造函数也很简单。下面重点分析IndexWriter的addDocument函数,代码如下,
addDocument函数不仅添加文档数据,而且创建了索引,下面来看。
IndexWriter::addDocument
public void addDocument(Iterable extends IndexableField> doc) throws IOException {
updateDocument(null, doc);
}
public void updateDocuments(Term delTerm, Iterable extends Iterable extends IndexableField>> docs) throws IOException {
ensureOpen();
try {
boolean success = false;
try {
if (docWriter.updateDocuments(docs, analyzer, delTerm)) {
processEvents(true, false);
}
success = true;
} finally {
}
} catch (AbortingException | VirtualMachineError tragedy) {
}
}
addDocument进而调用updateDocuments函数完成索引的添加。传入的参数delTerm与更新索引和删除操作有关,这里为null,不管它。updateDocuments函数首先通过ensureOpen函数确保IndexWriter未被关闭,然后就调用DocumentsWriter的updateDocuments函数,代码如下,
IndexWriter::updateDocuments->DocumentsWriter::updateDocuments
boolean updateDocuments(final Iterable extends Iterable extends IndexableField>> docs, final Analyzer analyzer,final Term delTerm) throws IOException, AbortingException {
boolean hasEvents = preUpdate();
final ThreadState perThread = flushControl.obtainAndLock();
final DocumentsWriterPerThread flushingDWPT;
try {
ensureOpen();
ensureInitialized(perThread);
assert perThread.isInitialized();
final DocumentsWriterPerThread dwpt = perThread.dwpt;
final int dwptNumDocs = dwpt.getNumDocsInRAM();
try {
dwpt.updateDocuments(docs, analyzer, delTerm);
} catch (AbortingException ae) {
} finally {
numDocsInRAM.addAndGet(dwpt.getNumDocsInRAM() - dwptNumDocs);
}
final boolean isUpdate = delTerm != null;
flushingDWPT = flushControl.doAfterDocument(perThread, isUpdate);
} finally {
perThreadPool.release(perThread);
}
return postUpdate(flushingDWPT, hasEvents);
}
updateDocuments首先调用preUpdate函数处理没有写入硬盘的数据,代码如下。
IndexWriter::updateDocuments->DocumentsWriter::updateDocuments->preUpdate
private boolean preUpdate() throws IOException, AbortingException {
ensureOpen();
boolean hasEvents = false;
if (flushControl.anyStalledThreads() || flushControl.numQueuedFlushes() > 0) {
do {
DocumentsWriterPerThread flushingDWPT;
while ((flushingDWPT = flushControl.nextPendingFlush()) != null) {
hasEvents |= doFlush(flushingDWPT);
}
flushControl.waitIfStalled();
} while (flushControl.numQueuedFlushes() != 0);
}
return hasEvents;
}
flushControl是在DocumentsWriter构造函数中创建的DocumentsWriterFlushControl。preUpdate函数从DocumentsWriterFlushControl中逐个取出DocumentsWriterPerThread,因为在lucene中只能有一个IndexWriter获得文件锁并操作索引文件,但是实际中对文档的索引需要多线程进行,DocumentsWriterPerThread就代表一个索引文档的线程。获取到DocumentsWriterPerThread之后,就通过doFlush将DocumentsWriterPerThread内存中的索引数据写入硬盘文件里。关于doFlush函数的分析,留在后面的章节。
回到DocumentsWriter的updateDocuments函数中,接下来通过DocumentsWriterFlushControl的obtainAndLock函数获得一个DocumentsWriterPerThread,DocumentsWriterPerThread被封装在ThreadState中,obtainAndLock函数的代码如下,
IndexWriter::updateDocuments->DocumentsWriter::updateDocuments->DocumentsWriterFlushControl::obtainAndLock
ThreadState obtainAndLock() {
final ThreadState perThread = perThreadPool.getAndLock(Thread
.currentThread(), documentsWriter);
boolean success = false;
try {
if (perThread.isInitialized()
&& perThread.dwpt.deleteQueue != documentsWriter.deleteQueue) {
addFlushableState(perThread);
}
success = true;
return perThread;
} finally {
}
}
obtainAndLock函数中的perThreadPool是在LiveIndexWriterConfig中创建的DocumentsWriterPerThreadPool,其对应的getAndLock函数如下,
IndexWriter::updateDocuments->DocumentsWriter::updateDocuments->DocumentsWriterFlushControl::obtainAndLock->DocumentsWriterPerThreadPool::getAndLock
ThreadState getAndLock(Thread requestingThread, DocumentsWriter documentsWriter) {
ThreadState threadState = null;
synchronized (this) {
if (freeList.isEmpty()) {
return newThreadState();
} else {
threadState = freeList.remove(freeList.size()-1);
if (threadState.dwpt == null) {
for(int i=0;iget(i);
if (ts.dwpt != null) {
freeList.set(i, threadState);
threadState = ts;
break;
}
}
}
}
}
threadState.lock();
return threadState;
}
getAndLock函数概括来说,就是如果freeList不为空,就从中取出成员变量dwpt不为空的ThreadState,否则就创建一个新的ThreadState,创建ThreadState对应的newThreadState函数如下,
IndexWriter::updateDocuments->DocumentsWriter::updateDocuments->DocumentsWriterFlushControl::obtainAndLock->DocumentsWriterPerThreadPool::getAndLock->newThreadState
private synchronized ThreadState newThreadState() {
while (aborted) {
try {
wait();
} catch (InterruptedException ie) {
throw new ThreadInterruptedException(ie);
}
}
ThreadState threadState = new ThreadState(null);
threadState.lock();
threadStates.add(threadState);
return threadState;
}
newThreadState函数创建一个新的ThreadState,并将其添加threadStates中。ThreadState的构造函数很简单,这里就不往下看了。
回到DocumentsWriterFlushControl的obtainAndLock函数中,如果新创建的ThreadState中的dwpt为空,因此isInitialized返回false,obtainAndLock直接返回刚刚创建的ThreadState。
再回到DocumentsWriter的updateDocuments函数中,接下来通过ensureInitialized函数初始化刚刚创建的ThreadState中的dwpt成员变量,ensureInitialized函数的定义如下
IndexWriter::updateDocuments->DocumentsWriter::updateDocuments->ensureInitialized
private void ensureInitialized(ThreadState state) throws IOException {
if (state.dwpt == null) {
final FieldInfos.Builder infos = new FieldInfos.Builder(writer.globalFieldNumberMap);
state.dwpt = new DocumentsWriterPerThread(writer, writer.newSegmentName(), directoryOrig,
directory, config, infoStream, deleteQueue, infos,
writer.pendingNumDocs, writer.enableTestPoints);
}
}
ensureInitialized函数会创建一个DocumentsWriterPerThread并赋值给ThreadState的dwpt成员变量,DocumentsWriterPerThread的构造函数很简单,这里就不往下看了。
初始化完ThreadState的dwpt后,updateDocuments函数继续调用刚刚创建的DocumentsWriterPerThread的updateDocuments函数来索引文档,定义如下,
IndexWriter::updateDocuments->DocumentsWriter::updateDocuments->DocumentsWriterPerThread::updateDocuments
public int updateDocuments(Iterable extends Iterable extends IndexableField>> docs, Analyzer analyzer, Term delTerm) throws IOException, AbortingException {
docState.analyzer = analyzer;
int docCount = 0;
boolean allDocsIndexed = false;
try {
for(Iterable extends IndexableField> doc : docs) {
reserveOneDoc();
docState.doc = doc;
docState.docID = numDocsInRAM;
docCount++;
boolean success = false;
try {
consumer.processDocument();
success = true;
} finally {
}
finishDocument(null);
}
allDocsIndexed = true;
} finally {
}
return docCount;
}
DocumentsWriterPerThread的updateDocuments函数首先调用reserveOneDoc查看索引中的文档数是否超过限制,文档的信息被封装在成员变量DocState中,然后调用consumer的processDocument函数继续处理,consumer被定义为DefaultIndexingChain。
DefaultIndexingChain是一个默认的索引处理链,下面来看它的processDocument函数。
IndexWriter::updateDocuments->DocumentsWriter::updateDocuments->DocumentsWriterPerThread::updateDocuments->DefaultIndexingChain::processDocument
public void processDocument() throws IOException, AbortingException {
int fieldCount = 0;
long fieldGen = nextFieldGen++;
termsHash.startDocument();
fillStoredFields(docState.docID);
startStoredFields();
boolean aborting = false;
try {
for (IndexableField field : docState.doc) {
fieldCount = processField(field, fieldGen, fieldCount);
}
} catch (AbortingException ae) {
} finally {
if (aborting == false) {
for (int i=0;itry {
termsHash.finishDocument();
} catch (Throwable th) {
}
}
DefaultIndexingChain中的termsHash在DefaultIndexingChain的构造函数中被定义为FreqProxTermsWriter,其startDocument最终会调用FreqProxTermsWriter以及TermVectorsConsumer的startDocument函数作一些简单的初始化工作,FreqProxTermsWriter和TermVectorsConsumer分别在内存中保存了词和词向量的信息。DefaultIndexingChain的processDocument函数接下来通过fillStoredFields继续完成一些初始化工作。
DefaultIndexingChain::processDocument->fillStoredFields
private void fillStoredFields(int docID) throws IOException, AbortingException {
while (lastStoredDocID < docID) {
startStoredFields();
finishStoredFields();
}
}
private void startStoredFields() throws IOException, AbortingException {
try {
initStoredFieldsWriter();
storedFieldsWriter.startDocument();
} catch (Throwable th) {
throw AbortingException.wrap(th);
}
lastStoredDocID++;
}
private void finishStoredFields() throws IOException, AbortingException {
try {
storedFieldsWriter.finishDocument();
} catch (Throwable th) {
}
}
startStoredFields函数中的initStoredFieldsWriter函数用来创建一个StoredFieldsWriter,用来存储Field域中的值,定义如下,
DefaultIndexingChain::processDocument->fillStoredFields->startStoredFields->initStoredFieldsWriter
private void initStoredFieldsWriter() throws IOException {
if (storedFieldsWriter == null) {
storedFieldsWriter = docWriter.codec.storedFieldsFormat().fieldsWriter(docWriter.directory, docWriter.getSegmentInfo(), IOContext.DEFAULT);
}
}
这里的docWriter是DocumentsWriterPerThread,其成员变量codec会被初始化为Lucene60Codec,其storedFieldsFormat函数返回一个Lucene50StoredFieldsFormat,这些类的名称都是为了兼容使用的。Lucene50StoredFieldsFormat的fieldsWriter函数如下,
DefaultIndexingChain::processDocument->fillStoredFields->startStoredFields->initStoredFieldsWriter->Lucene50StoredFieldsFormat::fieldsWriter
public StoredFieldsWriter fieldsWriter(Directory directory, SegmentInfo si, IOContext context) throws IOException {
String previous = si.putAttribute(MODE_KEY, mode.name());
return impl(mode).fieldsWriter(directory, si, context);
}
mode的默认值为BEST_SPEED,impl根据该mode会返回一个CompressingStoredFieldsFormat实例,CompressingStoredFieldsFormat的fieldsWriter函数最后会创建一个CompressingStoredFieldsWriter并返回,下面简单看一下CompressingStoredFieldsWriter的构造函数,
public CompressingStoredFieldsWriter(Directory directory, SegmentInfo si, String segmentSuffix, IOContext context,
String formatName, CompressionMode compressionMode, int chunkSize, int maxDocsPerChunk, int blockSize) throws IOException {
assert directory != null;
this.segment = si.name;
this.compressionMode = compressionMode;
this.compressor = compressionMode.newCompressor();
this.chunkSize = chunkSize;
this.maxDocsPerChunk = maxDocsPerChunk;
this.docBase = 0;
this.bufferedDocs = new GrowableByteArrayDataOutput(chunkSize);
this.numStoredFields = new int[16];
this.endOffsets = new int[16];
this.numBufferedDocs = 0;
boolean success = false;
IndexOutput indexStream = directory.createOutput(IndexFileNames.segmentFileName(segment, segmentSuffix, FIELDS_INDEX_EXTENSION),
context);
try {
fieldsStream = directory.createOutput(IndexFileNames.segmentFileName(segment, segmentSuffix, FIELDS_EXTENSION),context);
final String codecNameIdx = formatName + CODEC_SFX_IDX;
final String codecNameDat = formatName + CODEC_SFX_DAT;
CodecUtil.writeIndexHeader(indexStream, codecNameIdx, VERSION_CURRENT, si.getId(), segmentSuffix);
CodecUtil.writeIndexHeader(fieldsStream, codecNameDat, VERSION_CURRENT, si.getId(), segmentSuffix);
indexWriter = new CompressingStoredFieldsIndexWriter(indexStream, blockSize);
indexStream = null;
fieldsStream.writeVInt(chunkSize);
fieldsStream.writeVInt(PackedInts.VERSION_CURRENT);
success = true;
} finally {
}
}
简单地说,CompressingStoredFieldsWriter构造函数会创建对应的.fdt和.fdx文件,并写入相应的头信息。
回到DefaultIndexingChain的startStoredFields函数中,构造完CompressingStoredFieldsWriter后,会调用其startDocument函数,该函数为空。再回头看finishStoredFields函数,该函数会通过刚刚构造的CompressingStoredFieldsWriter的finishDocument函数进行一些统计工作,在满足条件时,通过flush函数将内存中的索引信息写入到硬盘文件中,flush函数的源码留在后面的章节分析。
DefaultIndexingChain::processDocument->fillStoredFields->finishStoredFields->CompressingStoredFieldsWriter::finishDocument
public void finishDocument() throws IOException {
if (numBufferedDocs == this.numStoredFields.length) {
final int newLength = ArrayUtil.oversize(numBufferedDocs + 1, 4);
this.numStoredFields = Arrays.copyOf(this.numStoredFields, newLength);
endOffsets = Arrays.copyOf(endOffsets, newLength);
}
this.numStoredFields[numBufferedDocs] = numStoredFieldsInDoc;
numStoredFieldsInDoc = 0;
endOffsets[numBufferedDocs] = bufferedDocs.length;
++numBufferedDocs;
if (triggerFlush()) {
flush();
}
}
再回到DefaultIndexingChain的processDocument函数中,接下来遍历文档的Field域,并通过processField函数逐个处理每个域,processField函数的定义如下,
DefaultIndexingChain::processDocument->processField
private int processField(IndexableField field, long fieldGen, int fieldCount) throws IOException, AbortingException {
String fieldName = field.name();
IndexableFieldType fieldType = field.fieldType();
PerField fp = null;
if (fieldType.indexOptions() != IndexOptions.NONE) {
fp = getOrAddField(fieldName, fieldType, true);
boolean first = fp.fieldGen != fieldGen;
fp.invert(field, first);
if (first) {
fields[fieldCount++] = fp;
fp.fieldGen = fieldGen;
}
} else {
verifyUnIndexedFieldType(fieldName, fieldType);
}
if (fieldType.stored()) {
if (fp == null) {
fp = getOrAddField(fieldName, fieldType, false);
}
if (fieldType.stored()) {
try {
storedFieldsWriter.writeField(fp.fieldInfo, field);
} catch (Throwable th) {
throw AbortingException.wrap(th);
}
}
}
return fieldCount;
}
processField函数中的getOrAddField函数用来根据Field的name信息创建或获得一个PerField,代码如下,
DefaultIndexingChain::processDocument->processField->getOrAddField
private PerField getOrAddField(String name, IndexableFieldType fieldType, boolean invert) {
final int hashPos = name.hashCode() & hashMask;
PerField fp = fieldHash[hashPos];
while (fp != null && !fp.fieldInfo.name.equals(name)) {
fp = fp.next;
}
if (fp == null) {
FieldInfo fi = fieldInfos.getOrAdd(name);
fi.setIndexOptions(fieldType.indexOptions());
fp = new PerField(fi, invert);
fp.next = fieldHash[hashPos];
fieldHash[hashPos] = fp;
totalFieldCount++;
if (totalFieldCount >= fieldHash.length/2) {
rehash();
}
if (totalFieldCount > fields.length) {
PerField[] newFields = new PerField[ArrayUtil.oversize(totalFieldCount, RamUsageEstimator.NUM_BYTES_OBJECT_REF)];
System.arraycopy(fields, 0, newFields, 0, fields.length);
fields = newFields;
}
} else if (invert && fp.invertState == null) {
fp.fieldInfo.setIndexOptions(fieldType.indexOptions());
fp.setInvertState();
}
return fp;
}
成员变量fieldHash是一个hash桶,用来存储PerField,PerField中的FieldInfo保存了Field的名称name等相应信息,并被选择适当的位置插入到fieldHash中,getOrAddField函数也会在适当的时候扩充hash桶,最后返回一个hash桶中指向新创建的PerField的指针。
回到processField函数中,接下来通过invert方法调用analyzer解析Field,这部分代码比较复杂,留到下一章分析。
回到processField函数中,假设fieldType为TYPE_STORED,而TYPE_STORED对应的stored函数返回true,表示需要对该域的值进行存储,因此接下来调用StoredFieldsWriter的writeField函数保存对应Field的值。根据本章前面的分析,这里的storedFieldsWriter为CompressingStoredFieldsWriter,其writeField函数如下,
DefaultIndexingChain::processDocument->processField->CompressingStoredFieldsWriter::writeField
public void writeField(FieldInfo info, IndexableField field)
throws IOException {
...
string = field.stringValue();
...
if (bytes != null) {
bufferedDocs.writeVInt(bytes.length);
bufferedDocs.writeBytes(bytes.bytes, bytes.offset, bytes.length);
} else if (string != null) {
bufferedDocs.writeString(string);
} else {
if (number instanceof Byte || number instanceof Short || number instanceof Integer) {
bufferedDocs.writeZInt(number.intValue());
} else if (number instanceof Long) {
writeTLong(bufferedDocs, number.longValue());
} else if (number instanceof Float) {
writeZFloat(bufferedDocs, number.floatValue());
} else if (number instanceof Double) {
writeZDouble(bufferedDocs, number.doubleValue());
} else {
throw new AssertionError("Cannot get here");
}
}
}
假设需要存储的Field的值为String类型,因此这里最终会调用bufferedDocs的writeString函数,bufferedDocs在CompressingStoredFieldsWriter的构造函数中被设置为GrowableByteArrayDataOutput,其writeString函数就是将对应的String值缓存在内存的某个结构中。在特定的时刻通过flush函数将这些数据存入.fdt文件中。
回到DefaultIndexingChain的processDocument函数中,接下来遍历fields,取出前面在processField函数中创建的各个PerField,并调用其finish函数,该函数深入看较为复杂,但没有特别重要的内容,这里就不往下看了。processDocument最后会调用finishStoredFields函数,该函数前面已经分析过了,主要是在必要的时候触发一次flush操作。
回到DocumentsWriterPerThread的updateDocuments函数中,接下来清空刚刚创建的DocState,并调用finishDocument处理一些需要删除的数据,这里先不管该函数。
再向上回到DocumentsWriter的updateDocuments函数中,接下来的numDocsInRAM保存了当前有多少文档被索引了,然后再调用DocumentsWriterFlushControl的doAfterDocument函数继续处理,,本章暂时不管它。然后updateDocuments函数调用release函数释放开始创建的ThreadState,即将其放入一个freeList中用来给后面的程序调用。最后通过postUpdate函数选择相应的DocumentsWriterPerThread并调用其doFlush函数将索引数据写入硬盘文件中。
IndexWriter::updateDocuments->DocumentsWriter::updateDocuments->
private boolean postUpdate(DocumentsWriterPerThread flushingDWPT, boolean hasEvents) throws IOException, AbortingException {
hasEvents |= applyAllDeletes(deleteQueue);
if (flushingDWPT != null) {
hasEvents |= doFlush(flushingDWPT);
} else {
final DocumentsWriterPerThread nextPendingFlush = flushControl.nextPendingFlush();
if (nextPendingFlush != null) {
hasEvents |= doFlush(nextPendingFlush);
}
}
return hasEvents;
}
再向上回到IndexWriter的updateDocument中,接下来调用processEvents触发相应的事件,代码如下,
IndexWriter::updateDocuments->processEvents
private boolean processEvents(boolean triggerMerge, boolean forcePurge) throws IOException {
return processEvents(eventQueue, triggerMerge, forcePurge);
}
private boolean processEvents(Queue queue, boolean triggerMerge, boolean forcePurge) throws IOException {
boolean processed = false;
if (tragedy == null) {
Event event;
while((event = queue.poll()) != null) {
processed = true;
event.process(this, triggerMerge, forcePurge);
}
}
return processed;
}
processEvents函数会一次从Queue队列中取出Event事件,并调用其process函数进行处理。