lucene源码分析---3

lucene源码分析—添加文档

为了方便分析,这里继续贴出第一章中给出的lucene中关于创建索引的实例,

            String filePath = ...//文件路径
            String indexPath = ...//索引路径
            File fileDir = new File(filePath);    
            Directory dir = FSDirectory.open(Paths.get(indexPath));  

            Analyzer luceneAnalyzer = new SimpleAnalyzer();
            IndexWriterConfig iwc = new IndexWriterConfig(luceneAnalyzer);  
            iwc.setOpenMode(OpenMode.CREATE);  
            IndexWriter indexWriter = new IndexWriter(dir,iwc);    
            File[] textFiles = fileDir.listFiles();    

            for (int i = 0; i < textFiles.length; i++) {    
                if (textFiles[i].isFile()) {     
                    String temp = FileReaderAll(textFiles[i].getCanonicalPath(),    
                            "GBK");    
                    Document document = new Document();    
                    Field FieldPath = new StringField("path", textFiles[i].getPath(), Field.Store.YES);
                    Field FieldBody = new TextField("body", temp, Field.Store.YES);    
                    document.add(FieldPath);    
                    document.add(FieldBody);    
                    indexWriter.addDocument(document);    
                }    
            }    
            indexWriter.close();

这段代码中,FileReaderAll函数用来从文件中读取字符串,默认编码为“GBK”。在创建完最重要的IndexWriter之后,就开始遍历需要索引的文件,构造对应的Document和Filed类,最终通过IndexWriter的addDocument函数开始索引。
Document的构造函数为空,StringField、TextField和Field的构造函数也很简单。下面重点分析IndexWriter的addDocument函数,代码如下,

addDocument

addDocument函数不仅添加文档数据,而且创建了索引,下面来看。
IndexWriter::addDocument

  public void addDocument(Iterable doc) throws IOException {
    updateDocument(null, doc);
  }
  public void updateDocuments(Term delTerm, Iterable> docs) throws IOException {
    ensureOpen();
    try {
      boolean success = false;
      try {
        if (docWriter.updateDocuments(docs, analyzer, delTerm)) {
          processEvents(true, false);
        }
        success = true;
      } finally {

      }
    } catch (AbortingException | VirtualMachineError tragedy) {

    }
  }

addDocument进而调用updateDocuments函数完成索引的添加。传入的参数delTerm与更新索引和删除操作有关,这里为null,不管它。updateDocuments函数首先通过ensureOpen函数确保IndexWriter未被关闭,然后就调用DocumentsWriter的updateDocuments函数,代码如下,
IndexWriter::updateDocuments->DocumentsWriter::updateDocuments

  boolean updateDocuments(final Iterable> docs, final Analyzer analyzer,final Term delTerm) throws IOException, AbortingException {

    boolean hasEvents = preUpdate();

    final ThreadState perThread = flushControl.obtainAndLock();
    final DocumentsWriterPerThread flushingDWPT;

    try {
      ensureOpen();
      ensureInitialized(perThread);
      assert perThread.isInitialized();
      final DocumentsWriterPerThread dwpt = perThread.dwpt;
      final int dwptNumDocs = dwpt.getNumDocsInRAM();
      try {
        dwpt.updateDocuments(docs, analyzer, delTerm);
      } catch (AbortingException ae) {

      } finally {
        numDocsInRAM.addAndGet(dwpt.getNumDocsInRAM() - dwptNumDocs);
      }
      final boolean isUpdate = delTerm != null;
      flushingDWPT = flushControl.doAfterDocument(perThread, isUpdate);
    } finally {
      perThreadPool.release(perThread);
    }

    return postUpdate(flushingDWPT, hasEvents);
  }

updateDocuments首先调用preUpdate函数处理没有写入硬盘的数据,代码如下。
IndexWriter::updateDocuments->DocumentsWriter::updateDocuments->preUpdate

  private boolean preUpdate() throws IOException, AbortingException {
    ensureOpen();
    boolean hasEvents = false;
    if (flushControl.anyStalledThreads() || flushControl.numQueuedFlushes() > 0) {
      do {
        DocumentsWriterPerThread flushingDWPT;
        while ((flushingDWPT = flushControl.nextPendingFlush()) != null) {
          hasEvents |= doFlush(flushingDWPT);
        }

        flushControl.waitIfStalled();
      } while (flushControl.numQueuedFlushes() != 0);

    }
    return hasEvents;
  }

flushControl是在DocumentsWriter构造函数中创建的DocumentsWriterFlushControl。preUpdate函数从DocumentsWriterFlushControl中逐个取出DocumentsWriterPerThread,因为在lucene中只能有一个IndexWriter获得文件锁并操作索引文件,但是实际中对文档的索引需要多线程进行,DocumentsWriterPerThread就代表一个索引文档的线程。获取到DocumentsWriterPerThread之后,就通过doFlush将DocumentsWriterPerThread内存中的索引数据写入硬盘文件里。关于doFlush函数的分析,留在后面的章节。

回到DocumentsWriter的updateDocuments函数中,接下来通过DocumentsWriterFlushControl的obtainAndLock函数获得一个DocumentsWriterPerThread,DocumentsWriterPerThread被封装在ThreadState中,obtainAndLock函数的代码如下,

IndexWriter::updateDocuments->DocumentsWriter::updateDocuments->DocumentsWriterFlushControl::obtainAndLock

  ThreadState obtainAndLock() {
    final ThreadState perThread = perThreadPool.getAndLock(Thread
        .currentThread(), documentsWriter);
    boolean success = false;
    try {
      if (perThread.isInitialized()
          && perThread.dwpt.deleteQueue != documentsWriter.deleteQueue) {
        addFlushableState(perThread);
      }
      success = true;
      return perThread;
    } finally {

    }
  }

obtainAndLock函数中的perThreadPool是在LiveIndexWriterConfig中创建的DocumentsWriterPerThreadPool,其对应的getAndLock函数如下,
IndexWriter::updateDocuments->DocumentsWriter::updateDocuments->DocumentsWriterFlushControl::obtainAndLock->DocumentsWriterPerThreadPool::getAndLock

  ThreadState getAndLock(Thread requestingThread, DocumentsWriter documentsWriter) {
    ThreadState threadState = null;
    synchronized (this) {
      if (freeList.isEmpty()) {
        return newThreadState();
      } else {
        threadState = freeList.remove(freeList.size()-1);
        if (threadState.dwpt == null) {
          for(int i=0;iget(i);
            if (ts.dwpt != null) {
              freeList.set(i, threadState);
              threadState = ts;
              break;
            }
          }
        }
      }
    }
    threadState.lock();
    return threadState;
  }

getAndLock函数概括来说,就是如果freeList不为空,就从中取出成员变量dwpt不为空的ThreadState,否则就创建一个新的ThreadState,创建ThreadState对应的newThreadState函数如下,
IndexWriter::updateDocuments->DocumentsWriter::updateDocuments->DocumentsWriterFlushControl::obtainAndLock->DocumentsWriterPerThreadPool::getAndLock->newThreadState

  private synchronized ThreadState newThreadState() {
    while (aborted) {
      try {
        wait();
      } catch (InterruptedException ie) {
        throw new ThreadInterruptedException(ie);        
      }
    }
    ThreadState threadState = new ThreadState(null);
    threadState.lock();
    threadStates.add(threadState);
    return threadState;
  }

newThreadState函数创建一个新的ThreadState,并将其添加threadStates中。ThreadState的构造函数很简单,这里就不往下看了。

回到DocumentsWriterFlushControl的obtainAndLock函数中,如果新创建的ThreadState中的dwpt为空,因此isInitialized返回false,obtainAndLock直接返回刚刚创建的ThreadState。

再回到DocumentsWriter的updateDocuments函数中,接下来通过ensureInitialized函数初始化刚刚创建的ThreadState中的dwpt成员变量,ensureInitialized函数的定义如下
IndexWriter::updateDocuments->DocumentsWriter::updateDocuments->ensureInitialized

  private void ensureInitialized(ThreadState state) throws IOException {
    if (state.dwpt == null) {
      final FieldInfos.Builder infos = new FieldInfos.Builder(writer.globalFieldNumberMap);
      state.dwpt = new DocumentsWriterPerThread(writer, writer.newSegmentName(), directoryOrig,
                                                directory, config, infoStream, deleteQueue, infos,
                                                writer.pendingNumDocs, writer.enableTestPoints);
    }
  }

ensureInitialized函数会创建一个DocumentsWriterPerThread并赋值给ThreadState的dwpt成员变量,DocumentsWriterPerThread的构造函数很简单,这里就不往下看了。

初始化完ThreadState的dwpt后,updateDocuments函数继续调用刚刚创建的DocumentsWriterPerThread的updateDocuments函数来索引文档,定义如下,
IndexWriter::updateDocuments->DocumentsWriter::updateDocuments->DocumentsWriterPerThread::updateDocuments

  public int updateDocuments(Iterable> docs, Analyzer analyzer, Term delTerm) throws IOException, AbortingException {
    docState.analyzer = analyzer;
    int docCount = 0;
    boolean allDocsIndexed = false;
    try {

      for(Iterable doc : docs) {
        reserveOneDoc();
        docState.doc = doc;
        docState.docID = numDocsInRAM;
        docCount++;

        boolean success = false;
        try {
          consumer.processDocument();
          success = true;
        } finally {

        }
        finishDocument(null);
      }
      allDocsIndexed = true;

    } finally {

    }

    return docCount;
  }

DocumentsWriterPerThread的updateDocuments函数首先调用reserveOneDoc查看索引中的文档数是否超过限制,文档的信息被封装在成员变量DocState中,然后调用consumer的processDocument函数继续处理,consumer被定义为DefaultIndexingChain。

DefaultIndexingChain的processDocument函数

DefaultIndexingChain是一个默认的索引处理链,下面来看它的processDocument函数。
IndexWriter::updateDocuments->DocumentsWriter::updateDocuments->DocumentsWriterPerThread::updateDocuments->DefaultIndexingChain::processDocument

  public void processDocument() throws IOException, AbortingException {

    int fieldCount = 0;
    long fieldGen = nextFieldGen++;
    termsHash.startDocument();
    fillStoredFields(docState.docID);
    startStoredFields();

    boolean aborting = false;
    try {
      for (IndexableField field : docState.doc) {
        fieldCount = processField(field, fieldGen, fieldCount);
      }
    } catch (AbortingException ae) {

    } finally {
      if (aborting == false) {
        for (int i=0;itry {
      termsHash.finishDocument();
    } catch (Throwable th) {

    }
  }

DefaultIndexingChain中的termsHash在DefaultIndexingChain的构造函数中被定义为FreqProxTermsWriter,其startDocument最终会调用FreqProxTermsWriter以及TermVectorsConsumer的startDocument函数作一些简单的初始化工作,FreqProxTermsWriter和TermVectorsConsumer分别在内存中保存了词和词向量的信息。DefaultIndexingChain的processDocument函数接下来通过fillStoredFields继续完成一些初始化工作。
DefaultIndexingChain::processDocument->fillStoredFields

  private void fillStoredFields(int docID) throws IOException, AbortingException {
    while (lastStoredDocID < docID) {
      startStoredFields();
      finishStoredFields();
    }
  }

  private void startStoredFields() throws IOException, AbortingException {
    try {
      initStoredFieldsWriter();
      storedFieldsWriter.startDocument();
    } catch (Throwable th) {
      throw AbortingException.wrap(th);
    }
    lastStoredDocID++;
  }

  private void finishStoredFields() throws IOException, AbortingException {
    try {
      storedFieldsWriter.finishDocument();
    } catch (Throwable th) {

    }
  }

startStoredFields函数中的initStoredFieldsWriter函数用来创建一个StoredFieldsWriter,用来存储Field域中的值,定义如下,
DefaultIndexingChain::processDocument->fillStoredFields->startStoredFields->initStoredFieldsWriter

  private void initStoredFieldsWriter() throws IOException {
    if (storedFieldsWriter == null) {
      storedFieldsWriter = docWriter.codec.storedFieldsFormat().fieldsWriter(docWriter.directory, docWriter.getSegmentInfo(), IOContext.DEFAULT);
    }
  }

这里的docWriter是DocumentsWriterPerThread,其成员变量codec会被初始化为Lucene60Codec,其storedFieldsFormat函数返回一个Lucene50StoredFieldsFormat,这些类的名称都是为了兼容使用的。Lucene50StoredFieldsFormat的fieldsWriter函数如下,
DefaultIndexingChain::processDocument->fillStoredFields->startStoredFields->initStoredFieldsWriter->Lucene50StoredFieldsFormat::fieldsWriter

  public StoredFieldsWriter fieldsWriter(Directory directory, SegmentInfo si, IOContext context) throws IOException {
    String previous = si.putAttribute(MODE_KEY, mode.name());
    return impl(mode).fieldsWriter(directory, si, context);
  }

mode的默认值为BEST_SPEED,impl根据该mode会返回一个CompressingStoredFieldsFormat实例,CompressingStoredFieldsFormat的fieldsWriter函数最后会创建一个CompressingStoredFieldsWriter并返回,下面简单看一下CompressingStoredFieldsWriter的构造函数,

  public CompressingStoredFieldsWriter(Directory directory, SegmentInfo si, String segmentSuffix, IOContext context,
      String formatName, CompressionMode compressionMode, int chunkSize, int maxDocsPerChunk, int blockSize) throws IOException {
    assert directory != null;
    this.segment = si.name;
    this.compressionMode = compressionMode;
    this.compressor = compressionMode.newCompressor();
    this.chunkSize = chunkSize;
    this.maxDocsPerChunk = maxDocsPerChunk;
    this.docBase = 0;
    this.bufferedDocs = new GrowableByteArrayDataOutput(chunkSize);
    this.numStoredFields = new int[16];
    this.endOffsets = new int[16];
    this.numBufferedDocs = 0;

    boolean success = false;
    IndexOutput indexStream = directory.createOutput(IndexFileNames.segmentFileName(segment, segmentSuffix, FIELDS_INDEX_EXTENSION), 
                                                                     context);
    try {
      fieldsStream = directory.createOutput(IndexFileNames.segmentFileName(segment, segmentSuffix, FIELDS_EXTENSION),context);

      final String codecNameIdx = formatName + CODEC_SFX_IDX;
      final String codecNameDat = formatName + CODEC_SFX_DAT;
      CodecUtil.writeIndexHeader(indexStream, codecNameIdx, VERSION_CURRENT, si.getId(), segmentSuffix);
      CodecUtil.writeIndexHeader(fieldsStream, codecNameDat, VERSION_CURRENT, si.getId(), segmentSuffix);

      indexWriter = new CompressingStoredFieldsIndexWriter(indexStream, blockSize);
      indexStream = null;

      fieldsStream.writeVInt(chunkSize);
      fieldsStream.writeVInt(PackedInts.VERSION_CURRENT);

      success = true;
    } finally {

    }
  }

简单地说,CompressingStoredFieldsWriter构造函数会创建对应的.fdt和.fdx文件,并写入相应的头信息。

回到DefaultIndexingChain的startStoredFields函数中,构造完CompressingStoredFieldsWriter后,会调用其startDocument函数,该函数为空。再回头看finishStoredFields函数,该函数会通过刚刚构造的CompressingStoredFieldsWriter的finishDocument函数进行一些统计工作,在满足条件时,通过flush函数将内存中的索引信息写入到硬盘文件中,flush函数的源码留在后面的章节分析。
DefaultIndexingChain::processDocument->fillStoredFields->finishStoredFields->CompressingStoredFieldsWriter::finishDocument

  public void finishDocument() throws IOException {
    if (numBufferedDocs == this.numStoredFields.length) {
      final int newLength = ArrayUtil.oversize(numBufferedDocs + 1, 4);
      this.numStoredFields = Arrays.copyOf(this.numStoredFields, newLength);
      endOffsets = Arrays.copyOf(endOffsets, newLength);
    }
    this.numStoredFields[numBufferedDocs] = numStoredFieldsInDoc;
    numStoredFieldsInDoc = 0;
    endOffsets[numBufferedDocs] = bufferedDocs.length;
    ++numBufferedDocs;
    if (triggerFlush()) {
      flush();
    }
  }

再回到DefaultIndexingChain的processDocument函数中,接下来遍历文档的Field域,并通过processField函数逐个处理每个域,processField函数的定义如下,
DefaultIndexingChain::processDocument->processField

  private int processField(IndexableField field, long fieldGen, int fieldCount) throws IOException, AbortingException {
    String fieldName = field.name();
    IndexableFieldType fieldType = field.fieldType();

    PerField fp = null;
    if (fieldType.indexOptions() != IndexOptions.NONE) {

      fp = getOrAddField(fieldName, fieldType, true);
      boolean first = fp.fieldGen != fieldGen;
      fp.invert(field, first);

      if (first) {
        fields[fieldCount++] = fp;
        fp.fieldGen = fieldGen;
      }
    } else {
      verifyUnIndexedFieldType(fieldName, fieldType);
    }

    if (fieldType.stored()) {
      if (fp == null) {
        fp = getOrAddField(fieldName, fieldType, false);
      }
      if (fieldType.stored()) {
        try {
          storedFieldsWriter.writeField(fp.fieldInfo, field);
        } catch (Throwable th) {
          throw AbortingException.wrap(th);
        }
      }
    }

    return fieldCount;
  }

processField函数中的getOrAddField函数用来根据Field的name信息创建或获得一个PerField,代码如下,
DefaultIndexingChain::processDocument->processField->getOrAddField

  private PerField getOrAddField(String name, IndexableFieldType fieldType, boolean invert) {

    final int hashPos = name.hashCode() & hashMask;
    PerField fp = fieldHash[hashPos];
    while (fp != null && !fp.fieldInfo.name.equals(name)) {
      fp = fp.next;
    }

    if (fp == null) {
      FieldInfo fi = fieldInfos.getOrAdd(name);
      fi.setIndexOptions(fieldType.indexOptions());

      fp = new PerField(fi, invert);
      fp.next = fieldHash[hashPos];
      fieldHash[hashPos] = fp;
      totalFieldCount++;

      if (totalFieldCount >= fieldHash.length/2) {
        rehash();
      }

      if (totalFieldCount > fields.length) {
        PerField[] newFields = new PerField[ArrayUtil.oversize(totalFieldCount, RamUsageEstimator.NUM_BYTES_OBJECT_REF)];
        System.arraycopy(fields, 0, newFields, 0, fields.length);
        fields = newFields;
      }

    } else if (invert && fp.invertState == null) {
      fp.fieldInfo.setIndexOptions(fieldType.indexOptions());
      fp.setInvertState();
    }

    return fp;
  }

成员变量fieldHash是一个hash桶,用来存储PerField,PerField中的FieldInfo保存了Field的名称name等相应信息,并被选择适当的位置插入到fieldHash中,getOrAddField函数也会在适当的时候扩充hash桶,最后返回一个hash桶中指向新创建的PerField的指针。
回到processField函数中,接下来通过invert方法调用analyzer解析Field,这部分代码比较复杂,留到下一章分析。

回到processField函数中,假设fieldType为TYPE_STORED,而TYPE_STORED对应的stored函数返回true,表示需要对该域的值进行存储,因此接下来调用StoredFieldsWriter的writeField函数保存对应Field的值。根据本章前面的分析,这里的storedFieldsWriter为CompressingStoredFieldsWriter,其writeField函数如下,
DefaultIndexingChain::processDocument->processField->CompressingStoredFieldsWriter::writeField

  public void writeField(FieldInfo info, IndexableField field)
      throws IOException {

    ...

    string = field.stringValue();

    ...

    if (bytes != null) {
      bufferedDocs.writeVInt(bytes.length);
      bufferedDocs.writeBytes(bytes.bytes, bytes.offset, bytes.length);
    } else if (string != null) {
      bufferedDocs.writeString(string);
    } else {
      if (number instanceof Byte || number instanceof Short || number instanceof Integer) {
        bufferedDocs.writeZInt(number.intValue());
      } else if (number instanceof Long) {
        writeTLong(bufferedDocs, number.longValue());
      } else if (number instanceof Float) {
        writeZFloat(bufferedDocs, number.floatValue());
      } else if (number instanceof Double) {
        writeZDouble(bufferedDocs, number.doubleValue());
      } else {
        throw new AssertionError("Cannot get here");
      }
    }
  }

假设需要存储的Field的值为String类型,因此这里最终会调用bufferedDocs的writeString函数,bufferedDocs在CompressingStoredFieldsWriter的构造函数中被设置为GrowableByteArrayDataOutput,其writeString函数就是将对应的String值缓存在内存的某个结构中。在特定的时刻通过flush函数将这些数据存入.fdt文件中。

回到DefaultIndexingChain的processDocument函数中,接下来遍历fields,取出前面在processField函数中创建的各个PerField,并调用其finish函数,该函数深入看较为复杂,但没有特别重要的内容,这里就不往下看了。processDocument最后会调用finishStoredFields函数,该函数前面已经分析过了,主要是在必要的时候触发一次flush操作。

回到DocumentsWriterPerThread的updateDocuments函数中,接下来清空刚刚创建的DocState,并调用finishDocument处理一些需要删除的数据,这里先不管该函数。

再向上回到DocumentsWriter的updateDocuments函数中,接下来的numDocsInRAM保存了当前有多少文档被索引了,然后再调用DocumentsWriterFlushControl的doAfterDocument函数继续处理,,本章暂时不管它。然后updateDocuments函数调用release函数释放开始创建的ThreadState,即将其放入一个freeList中用来给后面的程序调用。最后通过postUpdate函数选择相应的DocumentsWriterPerThread并调用其doFlush函数将索引数据写入硬盘文件中。
IndexWriter::updateDocuments->DocumentsWriter::updateDocuments->

  private boolean postUpdate(DocumentsWriterPerThread flushingDWPT, boolean hasEvents) throws IOException, AbortingException {
    hasEvents |= applyAllDeletes(deleteQueue);
    if (flushingDWPT != null) {
      hasEvents |= doFlush(flushingDWPT);
    } else {
      final DocumentsWriterPerThread nextPendingFlush = flushControl.nextPendingFlush();
      if (nextPendingFlush != null) {
        hasEvents |= doFlush(nextPendingFlush);
      }
    }

    return hasEvents;
  }

再向上回到IndexWriter的updateDocument中,接下来调用processEvents触发相应的事件,代码如下,
IndexWriter::updateDocuments->processEvents

  private boolean processEvents(boolean triggerMerge, boolean forcePurge) throws IOException {
    return processEvents(eventQueue, triggerMerge, forcePurge);
  }

  private boolean processEvents(Queue queue, boolean triggerMerge, boolean forcePurge) throws IOException {
    boolean processed = false;
    if (tragedy == null) {
      Event event;
      while((event = queue.poll()) != null)  {
        processed = true;
        event.process(this, triggerMerge, forcePurge);
      }
    }
    return processed;
  }

processEvents函数会一次从Queue队列中取出Event事件,并调用其process函数进行处理。

你可能感兴趣的:(lucene源码分析---3)