1 搜索示例
首先在lucene索引中预先写入了一些文档,主要包含两个field (id和name)信息,每个field都是stored和indexed
{
"id": 0,
"name": "Stephen"
},{
"id": 1,
"name": "Draymond"
},{
"id": 2,
"name", "LeBron"
},{
"id": 3,
"name": "Kevin"
}
我们使用如下代码,可以从lucene索引中搜索name为LeBron的文档信息,接下来,我们将通过一系列文章来分析这些代码是如何工作的
public class IndexSearcherTest {
public static void main(String[] args) throws IOException, ParseException {
Directory directory = FSDirectory.open("/lucene/index/path");
IndexReader indexReader = DirectoryReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
QueryParser queryParser = new QueryParser("name", new StandardAnalyzer());
Query query = queryParser.parse("LeBron");
TopDocs topDocs = indexSearcher.search(query, 10);
ScoreDoc[] scoreDocs = topDocs.scoreDocs;
for(ScoreDoc scoreDoc : scoreDocs){
System.out.println("doc: " + scoreDoc.doc + ", score: " + scoreDoc.score);
Document document = indexReader.document(scoreDoc.doc);
if(null == document){
continue;
}
System.out.println("id :" + document.get("id") + ", name: " + document.get("name"));
}
}
}
本片文章主要介绍打开索引路径和索引目录的部分
1 打开索引文件
由于索引存在本地磁盘中,可以使用FSDirectory打开本地的索引文件,获取索引路径的Directory对象
Directory directory = FSDirectory.open("/lucene/index/path");
① 如果当前jre是64位且支持Unmap(能加载sun.misc.Cleaner类和java.nio.DirectByteBuffer.cleaner()方法),则创建的是MMapDirectory对象
② 如果当前系统是Windows(判断操作系统名称是否以Windows开头),则创建SimpleFSDirectory对象
③ 如果不满足上述两种情况,则创建NIOFSDirectory对象
public abstract class FSDirectory extends BaseDirectory {
public static FSDirectory open(File path) throws IOException {
return open(path, null);
}
public static FSDirectory open(File path, LockFactory lockFactory) throws IOException {
if (Constants.JRE_IS_64BIT && MMapDirectory.UNMAP_SUPPORTED) {
return new MMapDirectory(path, lockFactory);
} else if (Constants.WINDOWS) {
return new SimpleFSDirectory(path, lockFactory);
} else {
return new NIOFSDirectory(path, lockFactory);
}
}
}
上述三种Directory的构造方法都是通过调用父类FSDirectory实现的
在FSDirectory的构造方法中,主要是初始化了lockFactory和directory对象
public abstract class FSDirectory extends BaseDirectory {
protected FSDirectory(File path, LockFactory lockFactory) throws IOException {
// new ctors use always NativeFSLockFactory as default:
if (lockFactory == null) {
lockFactory = new NativeFSLockFactory();
}
directory = path.getCanonicalFile();
if (directory.exists() && !directory.isDirectory())
throw new NoSuchDirectoryException("file '" + directory + "' exists but is not a directory");
setLockFactory(lockFactory);
}
}
2 打开索引目录
2.1 查找segment文件
在打开索引路径获得索引目录后,会使用DirectoryReader打开这个索引目录
IndexReader indexReader = DirectoryReader.open(directory);
这个过程非常复杂,主要是查找并打开索引的segment文件(segments_N和segments.gen),从segment中获取index文件信息并打开(.tip, .tim,.doc,.pos,.tvd,.tvx,.si,.nvd,.nvm)
public abstract class DirectoryReader extends BaseCompositeReader {
public static DirectoryReader open(final Directory directory) throws IOException {
return StandardDirectoryReader.open(directory, null, DEFAULT_TERMS_INDEX_DIVISOR);
}
}
在StandardDirectoryReader的open()方法中,主要是创建SegmentInfos.FindSegmentsFile对象并重写doBody()方法,然后执行对象的run()方法
final class StandardDirectoryReader extends DirectoryReader {
static DirectoryReader open(final Directory directory, final IndexCommit commit,
final int termInfosIndexDivisor) throws IOException {
return (DirectoryReader) new SegmentInfos.FindSegmentsFile(directory) {
@Override
protected Object doBody(String segmentFileName) throws IOException {
SegmentInfos sis = new SegmentInfos();
sis.read(directory, segmentFileName);
final SegmentReader[] readers = new SegmentReader[sis.size()];
boolean success = false;
try {
for (int i = sis.size()-1; i >= 0; i--) {
readers[i] = new SegmentReader(sis.info(i), termInfosIndexDivisor, IOContext.READ);
}
DirectoryReader reader = new StandardDirectoryReader(directory, readers, null, sis, termInfosIndexDivisor, false);
success = true;
return reader;
} finally {
if (success == false) {
IOUtils.closeWhileHandlingException(readers);
}
}
}
}.run(commit);
}
}
SegmentInfos.FindSegmentsFile的run()方法主要是查找segment文件
① 从以segment开头但不为segments.gen的文件中查找后缀最大的字符的作为genA
② 读取segment.gen文件,从lucene索引文件格式可知其格式如下:
GenHeader | Generation | Generation | Footer |
---|
generation为一个Long类型的数字,并且被写入了两次,如果两个值相同,则作为genB
③ 比较genA和genB的值,最大的作为最终的gen值,用segment_[gen]作为segment文件名
public final class SegmentInfos implements Cloneable, Iterable {
public abstract static class FindSegmentsFile {
public Object run(IndexCommit commit) throws IOException {
if (commit != null) {
if (directory != commit.getDirectory())
throw new IOException("the specified commit does not match the specified Directory");
return doBody(commit.getSegmentsFileName());
}
String segmentFileName = null;
long gen = 0;
int retryCount = 0;
boolean useFirstMethod = true;
while(true) {
if (useFirstMethod) {
files = directory.listAll();
if (files != null) {
genA = getLastCommitGeneration(files);
}
long genB = -1;
try {
genInput = directory.openChecksumInput(IndexFileNames.SEGMENTS_GEN, IOContext.READONCE);
} catch (IOException e) {
}
if (genInput != null) {
try {
int version = genInput.readInt();
if (version == FORMAT_SEGMENTS_GEN_47 || version == FORMAT_SEGMENTS_GEN_CHECKSUM) {
long gen0 = genInput.readLong();
long gen1 = genInput.readLong();
if (gen0 == gen1) {
// The file is consistent.
genB = gen0;
}
} else {
throw new IndexFormatTooNewException(genInput, version, FORMAT_SEGMENTS_GEN_START, FORMAT_SEGMENTS_GEN_CURRENT);
}
} catch (IOException err2) {
}
gen = Math.max(genA, genB);
if (useFirstMethod && lastGen == gen && retryCount >= 2) {
useFirstMethod = false;
}
segmentFileName = IndexFileNames.fileNameFromGeneration(IndexFileNames.SEGMENTS,
"",
gen);
try {
Object v = doBody(segmentFileName);
if (infoStream != null) {
message("success on " + segmentFileName);
}
return v;
} catch (IOException err) {
// ...
}
}
}
protected abstract Object doBody(String segmentFileName) throws IOException;
}
}
2.2 打开index文件
在SegmentInfos.FindSegmentsFile的doBody()方法中,主要是读取各个index文件
final class StandardDirectoryReader extends DirectoryReader {
static DirectoryReader open(final Directory directory, final IndexCommit commit,
final int termInfosIndexDivisor) throws IOException {
return (DirectoryReader) new SegmentInfos.FindSegmentsFile(directory) {
@Override
protected Object doBody(String segmentFileName) throws IOException {
SegmentInfos sis = new SegmentInfos();
sis.read(directory, segmentFileName);
final SegmentReader[] readers = new SegmentReader[sis.size()];
boolean success = false;
try {
for (int i = sis.size()-1; i >= 0; i--) {
readers[i] = new SegmentReader(sis.info(i), termInfosIndexDivisor, IOContext.READ);
}
// This may throw IllegalArgumentException if there are too many docs, so
// it must be inside try clause so we close readers in that case:
DirectoryReader reader = new StandardDirectoryReader(directory, readers, null, sis, termInfosIndexDivisor, false);
success = true;
return reader;
} finally {
if (success == false) {
IOUtils.closeWhileHandlingException(readers);
}
}
}
}.run(commit);
}
}
首先创建一个SegmentInfos()对象,然后调用sis.read(directory, segmentFileName);
读取segment文件
segment文件格式如下:
Header | Version | NameCounter | SegCount | SegCount | CommitUserData | Footer |
---|
然后遍历segment中的每一个段信息,调用SegmentReader的构造方法读取索引的segment 索引文件
public final class SegmentReader extends AtomicReader implements Accountable {
public SegmentReader(SegmentCommitInfo si, int termInfosIndexDivisor, IOContext context) throws IOException {
this.si = si;
// 读取cfs
fieldInfos = readFieldInfos(si);
// 读取tip tim nvd nvm fdt fdx tvf tvd tvx
core = new SegmentCoreReaders(this, si.info.dir, si, context, termInfosIndexDivisor);
segDocValues = new SegmentDocValues();
boolean success = false;
final Codec codec = si.info.getCodec();
try {
if (si.hasDeletions()) {
读取 del
liveDocs = codec.liveDocsFormat().readLiveDocs(directory(), si, IOContext.READONCE);
} else {
assert si.getDelCount() == 0;
liveDocs = null;
}
numDocs = si.info.getDocCount() - si.getDelCount();
if (fieldInfos.hasDocValues()) {
initDocValuesProducers(codec);
}
success = true;
} finally {
if (!success) {
doClose();
}
}
}
}
① 用readFieldInfos()
方法读取cfs文件,一个“虚拟”的文件,用于访问复合流
public final class SegmentReader extends AtomicReader implements Accountable {
static FieldInfos readFieldInfos(SegmentCommitInfo info) throws IOException {
final Directory dir;
final boolean closeDir;
if (info.getFieldInfosGen() == -1 && info.info.getUseCompoundFile()) {
// no fieldInfos gen and segment uses a compound file
dir = new CompoundFileDirectory(info.info.dir,
IndexFileNames.segmentFileName(info.info.name, "", IndexFileNames.COMPOUND_FILE_EXTENSION),
IOContext.READONCE,
false);
closeDir = true;
} else {
// gen'd FIS are read outside CFS, or the segment doesn't use a compound file
dir = info.info.dir;
closeDir = false;
}
try {
final String segmentSuffix = info.getFieldInfosGen() == -1 ? "" : Long.toString(info.getFieldInfosGen(), Character.MAX_RADIX);
Codec codec = info.info.getCodec();
FieldInfosFormat fisFormat = codec.fieldInfosFormat();
return fisFormat.getFieldInfosReader().read(dir, info.info.name, segmentSuffix, IOContext.READONCE);
} finally {
if (closeDir) {
dir.close();
}
}
}
}
② 构建SegmentCoreReaders对象时主要读取tip tim nvd nvm fdt fdx tvf tvd tvx 格式文件
final class SegmentCoreReaders implements Accountable {
SegmentCoreReaders(SegmentReader owner, Directory dir, SegmentCommitInfo si, IOContext context, int termsIndexDivisor) throws IOException {
if (termsIndexDivisor == 0) {
throw new IllegalArgumentException("indexDivisor must be < 0 (don't load terms index) or greater than 0 (got 0)");
}
final Codec codec = si.info.getCodec();
final Directory cfsDir; // confusing name: if (cfs) its the cfsdir, otherwise its the segment's directory.
boolean success = false;
try {
if (si.info.getUseCompoundFile()) {
// 读取cfs 文件
cfsDir = cfsReader = new CompoundFileDirectory(dir, IndexFileNames.segmentFileName(si.info.name, "", IndexFileNames.COMPOUND_FILE_EXTENSION), context, false);
} else {
cfsReader = null;
cfsDir = dir;
}
final FieldInfos fieldInfos = owner.fieldInfos;
this.termsIndexDivisor = termsIndexDivisor;
final PostingsFormat format = codec.postingsFormat();
final SegmentReadState segmentReadState = new SegmentReadState(cfsDir, si.info, fieldInfos, context, termsIndexDivisor);
// 读取tip 和 tim 文件
fields = format.fieldsProducer(segmentReadState);
assert fields != null;
if (fieldInfos.hasNorms()) {
// 读取 nvd和 nvm 文件
normsProducer = codec.normsFormat().normsProducer(segmentReadState);
assert normsProducer != null;
} else {
normsProducer = null;
}
// 读取fdx 和fdt 文件
fieldsReaderOrig = si.info.getCodec().storedFieldsFormat().fieldsReader(cfsDir, si.info, fieldInfos, context);
if (fieldInfos.hasVectors()) {
// 读取 tvf tvd 和 tvx 文件
termVectorsReaderOrig = si.info.getCodec().termVectorsFormat().vectorsReader(cfsDir, si.info, fieldInfos, context);
} else {
termVectorsReaderOrig = null;
}
success = true;
} finally {
if (!success) {
decRef();
}
}
}
}
tim文件格式如下:
Header | FSTIndexNumFields | DirOffset | Footer |
---|
tip文件格式如下:
Header | PostingsHeader | NumBlocks | FieldSummary | DirOffset | Footer |
---|
public class BlockTreeTermsReader extends FieldsProducer {
public BlockTreeTermsReader(Directory dir, FieldInfos fieldInfos, SegmentInfo info,
PostingsReaderBase postingsReader, IOContext ioContext,
String segmentSuffix, int indexDivisor)
throws IOException {
this.postingsReader = postingsReader;
this.segment = info.name;
// 读取cfs
in = dir.openInput(IndexFileNames.segmentFileName(segment, segmentSuffix, BlockTreeTermsWriter.TERMS_EXTENSION),
ioContext);
boolean success = false;
IndexInput indexIn = null;
try {
version = readHeader(in);
if (indexDivisor != -1) {
indexIn = dir.openInput(IndexFileNames.segmentFileName(segment, segmentSuffix, BlockTreeTermsWriter.TERMS_INDEX_EXTENSION),
ioContext);
int indexVersion = readIndexHeader(indexIn);
if (indexVersion != version) {
throw new CorruptIndexException("mixmatched version files: " + in + "=" + version + "," + indexIn + "=" + indexVersion);
}
}
// verify
if (indexIn != null && version >= BlockTreeTermsWriter.VERSION_CHECKSUM) {
CodecUtil.checksumEntireFile(indexIn);
}
// Have PostingsReader init itself
postingsReader.init(in);
// ...
}
}
fdx 文件格式
fdt文件格式
public final class Lucene40StoredFieldsReader extends StoredFieldsReader implements Cloneable, Closeable {
public Lucene40StoredFieldsReader(Directory d, SegmentInfo si, FieldInfos fn, IOContext context) throws IOException {
final String segment = si.name;
boolean success = false;
fieldInfos = fn;
try {
fieldsStream = d.openInput(IndexFileNames.segmentFileName(segment, "", FIELDS_EXTENSION), context);
final String indexStreamFN = IndexFileNames.segmentFileName(segment, "", FIELDS_INDEX_EXTENSION);
indexStream = d.openInput(indexStreamFN, context);
CodecUtil.checkHeader(indexStream, CODEC_NAME_IDX, VERSION_START, VERSION_CURRENT);
CodecUtil.checkHeader(fieldsStream, CODEC_NAME_DAT, VERSION_START, VERSION_CURRENT);
assert HEADER_LENGTH_DAT == fieldsStream.getFilePointer();
assert HEADER_LENGTH_IDX == indexStream.getFilePointer();
final long indexSize = indexStream.length() - HEADER_LENGTH_IDX;
this.size = (int) (indexSize >> 3);
// Verify two sources of "maxDoc" agree:
if (this.size != si.getDocCount()) {
throw new CorruptIndexException("doc counts differ for segment " + segment + ": fieldsReader shows " + this.size + " but segmentInfo shows " + si.getDocCount());
}
numTotalDocs = (int) (indexSize >> 3);
success = true;
} finally {
if (!success) {
try {
close();
} catch (Throwable t) {} // ensure we throw our original exception
}
}
}
}
tvx 文件格式
Header |
---|
tvd 文件格式
Header |
---|
tvf 文件格式
Header |
---|
③ liveDocs = codec.liveDocsFormat().readLiveDocs(directory(), si, IOContext.READONCE);
主要读取del文件
del文件格式
Format | Header | ByteCount | BitCount | Bits |
---|
public class Lucene40LiveDocsFormat extends LiveDocsFormat {
public Bits readLiveDocs(Directory dir, SegmentCommitInfo info, IOContext context) throws IOException {
String filename = IndexFileNames.fileNameFromGeneration(info.info.name, DELETES_EXTENSION, info.getDelGen());
final BitVector liveDocs = new BitVector(dir, filename, context);
if (liveDocs.length() != info.info.getDocCount()) {
throw new CorruptIndexException("liveDocs.length()=" + liveDocs.length() + "info.docCount=" + info.info.getDocCount() + " (filename=" + filename + ")");
}
if (liveDocs.count() != info.info.getDocCount() - info.getDelCount()) {
throw new CorruptIndexException("liveDocs.count()=" + liveDocs.count() + " info.docCount=" + info.info.getDocCount() + " info.getDelCount()=" + info.getDelCount() + " (filename=" + filename + ")");
}
return liveDocs;
}
}
3 创建IndexSearcher
在打开索引目录后,接着创建IndexSearcher对象
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
创建IndexSearcher的过程,主要是初始化searcher的context和reader
public class IndexSearcher {
public IndexSearcher(IndexReader r) {
this(r, null);
}
public IndexSearcher(IndexReader r, ExecutorService executor) {
this(r.getContext(), executor);
}
public IndexSearcher(IndexReaderContext context, ExecutorService executor) {
assert context.isTopLevel: "IndexSearcher's ReaderContext must be topLevel for reader" + context.reader();
reader = context.reader();
this.executor = executor;
this.readerContext = context;
leafContexts = context.leaves();
this.leafSlices = executor == null ? null : slices(leafContexts);
}
}