根据《lucene源码分析—3》中的分析,倒排表的存储函数是DefaultIndexingChain的processField函数中的invert函数,invert函数定义在PerField中,代码如下,
IndexWriter::updateDocuments->DocumentsWriter::updateDocuments->DefaultIndexingChain::processDocument->processField->PerField::invert
public void invert(IndexableField field, boolean first) throws IOException, AbortingException {
if (first) {
invertState.reset();
}
IndexableFieldType fieldType = field.fieldType();
IndexOptions indexOptions = fieldType.indexOptions();
fieldInfo.setIndexOptions(indexOptions);
if (fieldType.omitNorms()) {
fieldInfo.setOmitsNorms();
}
final boolean analyzed = fieldType.tokenized() && docState.analyzer != null;
final boolean checkOffsets = indexOptions == IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS;
boolean succeededInProcessingField = false;
try (TokenStream stream = tokenStream = field.tokenStream(docState.analyzer, tokenStream)) {
stream.reset();
invertState.setAttributeSource(stream);
termsHashPerField.start(field, first);
while (stream.incrementToken()) {
int posIncr = invertState.posIncrAttribute.getPositionIncrement();
invertState.position += posIncr;
invertState.lastPosition = invertState.position;
if (posIncr == 0) {
invertState.numOverlap++;
}
if (checkOffsets) {
int startOffset = invertState.offset + invertState.offsetAttribute.startOffset();
int endOffset = invertState.offset + invertState.offsetAttribute.endOffset();
invertState.lastStartOffset = startOffset;
}
invertState.length++;
try {
termsHashPerField.add();
} catch (MaxBytesLengthExceededException e) {
} catch (Throwable th) {
}
}
stream.end();
invertState.position += invertState.posIncrAttribute.getPositionIncrement();
invertState.offset += invertState.offsetAttribute.endOffset();
succeededInProcessingField = true;
}
if (analyzed) {
invertState.position += docState.analyzer.getPositionIncrementGap(fieldInfo.name);
invertState.offset += docState.analyzer.getOffsetGap(fieldInfo.name);
}
invertState.boost *= field.boost();
}
传入的参数Field假设为TextField,布尔类型first表示是否是某个文档中对应Field名字name的第一个Field。invertState存储了本次建立索引的各个状态,如果是第一次,这要通过reset函数对invertState进行初始化。invert函数接下来通过tokenStream函数获得TokenStream,TokenStream代表一个分词器。tokenStream函数定义在Field中,代码如下,
PerField::invert->Field::tokenStream
public TokenStream tokenStream(Analyzer analyzer, TokenStream reuse) {
if (fieldType().indexOptions() == IndexOptions.NONE) {
return null;
}
...
if (tokenStream != null) {
return tokenStream;
} else if (readerValue() != null) {
return analyzer.tokenStream(name(), readerValue());
} else if (stringValue() != null) {
return analyzer.tokenStream(name(), stringValue());
}
}
这段tokenStream代码进行了简化,假设Field的域值是String类型,Analyzer设定为SimpleAnalyzer,因此接下来看SimpleAnalyzer的tokenStream函数,
PerField::invert->Field::tokenStream->SimpleAnalyzer::tokenStream
public final TokenStream tokenStream(final String fieldName,
final Reader reader) {
TokenStreamComponents components = reuseStrategy.getReusableComponents(this, fieldName);
final Reader r = initReader(fieldName, reader);
if (components == null) {
components = createComponents(fieldName);
reuseStrategy.setReusableComponents(this, fieldName, components);
}
components.setReader(r);
return components.getTokenStream();
}
Analyzer的默认reuseStrategy为ReuseStrategy,其getReusableComponents函数如下,
PerField::invert->Field::tokenStream->SimpleAnalyzer::tokenStream->ReuseStrategy::getReusableComponents
public TokenStreamComponents getReusableComponents(Analyzer analyzer, String fieldName) {
return (TokenStreamComponents) getStoredValue(analyzer);
}
protected final Object getStoredValue(Analyzer analyzer) {
return analyzer.storedValue.get();
}
storedValue是线程安全的Object变量。
假设通过getReusableComponents函数获取到的TokenStreamComponents为空,因此tokenStream函数接下来会创建一个ReusableStringReader用来封装String类型的域值,然后通过createComponents函数创建一个TokenStreamComponents,createComponents函数被SimpleAnalyzer重载,定义如下,
PerField::invert->Field::tokenStream->SimpleAnalyzer::tokenStream->SimpleAnalyzer::createComponents
protected TokenStreamComponents createComponents(final String fieldName) {
return new TokenStreamComponents(new LowerCaseTokenizer());
}
createComponents创建一个LowerCaseTokenizer,并将其作为参数构造一个TokenStreamComponents。
回到Analyzer的tokenStream函数中,接下来将刚刚创建的TokenStreamComponents缓存到ReuseStrategy中。最后设置Field的域值Reader,并返回TokenStream。从createComponents可以知道,这里返回的其实是LowerCaseTokenizer。
再回到PerField的invert函数中,接下来调用TokenStream的reset函数进行初始化,该reset函数初始化一些值,并设置其成员变量input为前面封装Field域值的Reader。
invert函数接下来通过setAttributeSource设置invertState的各个属性,然后依次调用TermsHashPerField的各个start函数进行一些初始化工作,该start函数最终会调用FreqProxTermsWriterPerField以及TermVectorsConsumerPerField的start函数。再往下,invert函数通过while循环,调用TokenStream的incrementToken函数,下面看LowerCaseTokenizer的incrementToken函数,定义如下
PerField::invert->LowerCaseTokenizer::incrementToken
public final boolean incrementToken() throws IOException {
clearAttributes();
int length = 0;
int start = -1;
int end = -1;
char[] buffer = termAtt.buffer();
while (true) {
if (bufferIndex >= dataLen) {
offset += dataLen;
charUtils.fill(ioBuffer, input);
if (ioBuffer.getLength() == 0) {
dataLen = 0;
if (length > 0) {
break;
} else {
finalOffset = correctOffset(offset);
return false;
}
}
dataLen = ioBuffer.getLength();
bufferIndex = 0;
}
final int c = charUtils.codePointAt(ioBuffer.getBuffer(), bufferIndex, ioBuffer.getLength());
final int charCount = Character.charCount(c);
bufferIndex += charCount;
if (isTokenChar(c)) {
if (length == 0) {
start = offset + bufferIndex - charCount;
end = start;
} else if (length >= buffer.length-1) {
buffer = termAtt.resizeBuffer(2+length);
}
end += charCount;
length += Character.toChars(normalize(c), buffer, length);
if (length >= MAX_WORD_LEN)
break;
} else if (length > 0)
break;
}
termAtt.setLength(length);
offsetAtt.setOffset(correctOffset(start), finalOffset = correctOffset(end));
return true;
}
这段代码虽然长,但这里只看关键的几个函数,成员变量input的类型为Reader,存放了Field域中的值,charUtils实际上是Java5CharacterUtils。首先通过Java5CharacterUtils的fill函数将input中的值读取到类型为CharacterBuffer的ioBuffer中,然后通过isTokenChar和normalize函数判断并且过滤字符c,isTokenChar用来判断字符c是否为字母,而normalize将c转化为小写字母,最终将字符c放入buffer,也即termAtt.buffer()中。
回到DefaultIndexingChain的invert函数中,接下来对成员变量invertState进行一些设置后就调用termsHashPerField的add函数,termsHashPerField的类型为FreqProxTermsWriterPerField,add函数定义在其父类TermsHashPerField中,
PerField::invert->FreqProxTermsWriterPerField::add
void add() throws IOException {
int termID = bytesHash.add(termAtt.getBytesRef());
if (termID >= 0) {
bytesHash.byteStart(termID);
if (numPostingInt + intPool.intUpto > IntBlockPool.INT_BLOCK_SIZE) {
intPool.nextBuffer();
}
if (ByteBlockPool.BYTE_BLOCK_SIZE - bytePool.byteUpto < numPostingInt*ByteBlockPool.FIRST_LEVEL_SIZE) {
bytePool.nextBuffer();
}
intUptos = intPool.buffer;
intUptoStart = intPool.intUpto;
intPool.intUpto += streamCount;
postingsArray.intStarts[termID] = intUptoStart + intPool.intOffset;
for(int i=0;i
final int upto = bytePool.newSlice(ByteBlockPool.FIRST_LEVEL_SIZE);
intUptos[intUptoStart+i] = upto + bytePool.byteOffset;
}
postingsArray.byteStarts[termID] = intUptos[intUptoStart];
newTerm(termID);
} else {
termID = (-termID)-1;
int intStart = postingsArray.intStarts[termID];
intUptos = intPool.buffers[intStart >> IntBlockPool.INT_BLOCK_SHIFT];
intUptoStart = intStart & IntBlockPool.INT_BLOCK_MASK;
addTerm(termID);
}
if (doNextCall) {
nextPerField.add(postingsArray.textStarts[termID]);
}
}
add函数中的bytePool用于存放词的频率和位置信息,intPool用于存放每个词在bytePool中的偏移量,当空间不足时,两者都会通过nextBuffer函数分配新的空间。再深入,跟踪newTerm和addTerm函数可知,最终将Field的相关数据存储到类型为FreqProxPostingsArray的freqProxPostingsArray中,以及TermVectorsPostingsArray的termVectorsPostingsArray中,这两者就是最终的倒排序表,由于后面的代码涉及到太多的内存存储细节以及数据结构,这里就不往下看了,等有时间了再补充。
总结一下invert函数,该函数会通过分词器依次取出处理后的每个词,对于SimpleAnalyzer而言取出的就是一个个字符,然后将其存放在TermsHashPerField对象的IntBlockPool、ByteBlockPool以及倒排序表FreqProxPostingsArray和TermVectorsPostingsArray中。