lucene源码分析---4

lucene源码分析—倒排表存储

根据《lucene源码分析—3》中的分析,倒排表的存储函数是DefaultIndexingChain的processField函数中的invert函数,invert函数定义在PerField中,代码如下,
IndexWriter::updateDocuments->DocumentsWriter::updateDocuments->DefaultIndexingChain::processDocument->processField->PerField::invert

    public void invert(IndexableField field, boolean first) throws IOException, AbortingException {
      if (first) {
        invertState.reset();
      }

      IndexableFieldType fieldType = field.fieldType();

      IndexOptions indexOptions = fieldType.indexOptions();
      fieldInfo.setIndexOptions(indexOptions);

      if (fieldType.omitNorms()) {
        fieldInfo.setOmitsNorms();
      }

      final boolean analyzed = fieldType.tokenized() && docState.analyzer != null;
      final boolean checkOffsets = indexOptions == IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS;

      boolean succeededInProcessingField = false;
      try (TokenStream stream = tokenStream = field.tokenStream(docState.analyzer, tokenStream)) {
        stream.reset();
        invertState.setAttributeSource(stream);
        termsHashPerField.start(field, first);

        while (stream.incrementToken()) {
          int posIncr = invertState.posIncrAttribute.getPositionIncrement();
          invertState.position += posIncr;
          invertState.lastPosition = invertState.position;
          if (posIncr == 0) {
            invertState.numOverlap++;
          }

          if (checkOffsets) {
            int startOffset = invertState.offset + invertState.offsetAttribute.startOffset();
            int endOffset = invertState.offset + invertState.offsetAttribute.endOffset();
            invertState.lastStartOffset = startOffset;
          }

          invertState.length++;
          try {
            termsHashPerField.add();
          } catch (MaxBytesLengthExceededException e) {

          } catch (Throwable th) {

          }
        }
        stream.end();
        invertState.position += invertState.posIncrAttribute.getPositionIncrement();
        invertState.offset += invertState.offsetAttribute.endOffset();
        succeededInProcessingField = true;
      }

      if (analyzed) {
        invertState.position += docState.analyzer.getPositionIncrementGap(fieldInfo.name);
        invertState.offset += docState.analyzer.getOffsetGap(fieldInfo.name);
      }

      invertState.boost *= field.boost();
    }

传入的参数Field假设为TextField,布尔类型first表示是否是某个文档中对应Field名字name的第一个Field。invertState存储了本次建立索引的各个状态,如果是第一次,这要通过reset函数对invertState进行初始化。invert函数接下来通过tokenStream函数获得TokenStream,TokenStream代表一个分词器。tokenStream函数定义在Field中,代码如下,
PerField::invert->Field::tokenStream

  public TokenStream tokenStream(Analyzer analyzer, TokenStream reuse) {
    if (fieldType().indexOptions() == IndexOptions.NONE) {
      return null;
    }

    ...

    if (tokenStream != null) {
      return tokenStream;
    } else if (readerValue() != null) {
      return analyzer.tokenStream(name(), readerValue());
    } else if (stringValue() != null) {
      return analyzer.tokenStream(name(), stringValue());
    }

  }

这段tokenStream代码进行了简化,假设Field的域值是String类型,Analyzer设定为SimpleAnalyzer,因此接下来看SimpleAnalyzer的tokenStream函数,
PerField::invert->Field::tokenStream->SimpleAnalyzer::tokenStream

  public final TokenStream tokenStream(final String fieldName,
                                       final Reader reader) {
    TokenStreamComponents components = reuseStrategy.getReusableComponents(this, fieldName);
    final Reader r = initReader(fieldName, reader);
    if (components == null) {
      components = createComponents(fieldName);
      reuseStrategy.setReusableComponents(this, fieldName, components);
    }
    components.setReader(r);
    return components.getTokenStream();
  }

Analyzer的默认reuseStrategy为ReuseStrategy,其getReusableComponents函数如下,
PerField::invert->Field::tokenStream->SimpleAnalyzer::tokenStream->ReuseStrategy::getReusableComponents

    public TokenStreamComponents getReusableComponents(Analyzer analyzer, String fieldName) {
      return (TokenStreamComponents) getStoredValue(analyzer);
    }

    protected final Object getStoredValue(Analyzer analyzer) {
      return analyzer.storedValue.get();
    }

storedValue是线程安全的Object变量。
假设通过getReusableComponents函数获取到的TokenStreamComponents为空,因此tokenStream函数接下来会创建一个ReusableStringReader用来封装String类型的域值,然后通过createComponents函数创建一个TokenStreamComponents,createComponents函数被SimpleAnalyzer重载,定义如下,
PerField::invert->Field::tokenStream->SimpleAnalyzer::tokenStream->SimpleAnalyzer::createComponents

  protected TokenStreamComponents createComponents(final String fieldName) {
    return new TokenStreamComponents(new LowerCaseTokenizer());
  }

createComponents创建一个LowerCaseTokenizer,并将其作为参数构造一个TokenStreamComponents。

回到Analyzer的tokenStream函数中,接下来将刚刚创建的TokenStreamComponents缓存到ReuseStrategy中。最后设置Field的域值Reader,并返回TokenStream。从createComponents可以知道,这里返回的其实是LowerCaseTokenizer。

再回到PerField的invert函数中,接下来调用TokenStream的reset函数进行初始化,该reset函数初始化一些值,并设置其成员变量input为前面封装Field域值的Reader。

invert函数接下来通过setAttributeSource设置invertState的各个属性,然后依次调用TermsHashPerField的各个start函数进行一些初始化工作,该start函数最终会调用FreqProxTermsWriterPerField以及TermVectorsConsumerPerField的start函数。再往下,invert函数通过while循环,调用TokenStream的incrementToken函数,下面看LowerCaseTokenizer的incrementToken函数,定义如下
PerField::invert->LowerCaseTokenizer::incrementToken

  public final boolean incrementToken() throws IOException {
    clearAttributes();
    int length = 0;
    int start = -1;
    int end = -1;
    char[] buffer = termAtt.buffer();
    while (true) {
      if (bufferIndex >= dataLen) {
        offset += dataLen;
        charUtils.fill(ioBuffer, input);
        if (ioBuffer.getLength() == 0) {
          dataLen = 0;
          if (length > 0) {
            break;
          } else {
            finalOffset = correctOffset(offset);
            return false;
          }
        }
        dataLen = ioBuffer.getLength();
        bufferIndex = 0;
      }
      final int c = charUtils.codePointAt(ioBuffer.getBuffer(), bufferIndex, ioBuffer.getLength());
      final int charCount = Character.charCount(c);
      bufferIndex += charCount;

      if (isTokenChar(c)) {
        if (length == 0) {
          start = offset + bufferIndex - charCount;
          end = start;
        } else if (length >= buffer.length-1) {
          buffer = termAtt.resizeBuffer(2+length);
        }
        end += charCount;
        length += Character.toChars(normalize(c), buffer, length);
        if (length >= MAX_WORD_LEN)
          break;
      } else if (length > 0)
        break;
    }

    termAtt.setLength(length);
    offsetAtt.setOffset(correctOffset(start), finalOffset = correctOffset(end));
    return true;

  }

这段代码虽然长,但这里只看关键的几个函数,成员变量input的类型为Reader,存放了Field域中的值,charUtils实际上是Java5CharacterUtils。首先通过Java5CharacterUtils的fill函数将input中的值读取到类型为CharacterBuffer的ioBuffer中,然后通过isTokenChar和normalize函数判断并且过滤字符c,isTokenChar用来判断字符c是否为字母,而normalize将c转化为小写字母,最终将字符c放入buffer,也即termAtt.buffer()中。

回到DefaultIndexingChain的invert函数中,接下来对成员变量invertState进行一些设置后就调用termsHashPerField的add函数,termsHashPerField的类型为FreqProxTermsWriterPerField,add函数定义在其父类TermsHashPerField中,
PerField::invert->FreqProxTermsWriterPerField::add

  void add() throws IOException {
    int termID = bytesHash.add(termAtt.getBytesRef());
    if (termID >= 0) {
      bytesHash.byteStart(termID);
      if (numPostingInt + intPool.intUpto > IntBlockPool.INT_BLOCK_SIZE) {
        intPool.nextBuffer();
      }

      if (ByteBlockPool.BYTE_BLOCK_SIZE - bytePool.byteUpto < numPostingInt*ByteBlockPool.FIRST_LEVEL_SIZE) {
        bytePool.nextBuffer();
      }

      intUptos = intPool.buffer;
      intUptoStart = intPool.intUpto;
      intPool.intUpto += streamCount;

      postingsArray.intStarts[termID] = intUptoStart + intPool.intOffset;

      for(int i=0;i
        final int upto = bytePool.newSlice(ByteBlockPool.FIRST_LEVEL_SIZE);
        intUptos[intUptoStart+i] = upto + bytePool.byteOffset;
      }
      postingsArray.byteStarts[termID] = intUptos[intUptoStart];

      newTerm(termID);

    } else {
      termID = (-termID)-1;
      int intStart = postingsArray.intStarts[termID];
      intUptos = intPool.buffers[intStart >> IntBlockPool.INT_BLOCK_SHIFT];
      intUptoStart = intStart & IntBlockPool.INT_BLOCK_MASK;
      addTerm(termID);
    }

    if (doNextCall) {
      nextPerField.add(postingsArray.textStarts[termID]);
    }
  }

add函数中的bytePool用于存放词的频率和位置信息,intPool用于存放每个词在bytePool中的偏移量,当空间不足时,两者都会通过nextBuffer函数分配新的空间。再深入,跟踪newTerm和addTerm函数可知,最终将Field的相关数据存储到类型为FreqProxPostingsArray的freqProxPostingsArray中,以及TermVectorsPostingsArray的termVectorsPostingsArray中,这两者就是最终的倒排序表,由于后面的代码涉及到太多的内存存储细节以及数据结构,这里就不往下看了,等有时间了再补充。

总结一下invert函数,该函数会通过分词器依次取出处理后的每个词,对于SimpleAnalyzer而言取出的就是一个个字符,然后将其存放在TermsHashPerField对象的IntBlockPool、ByteBlockPool以及倒排序表FreqProxPostingsArray和TermVectorsPostingsArray中。

你可能感兴趣的:(lucene源码分析---4)