从Lucene 2.3版本开始,Lucene提供了一个热备策略,就是SnapshotDeletionPolicy,这样就能在不关闭IndexWriter的情况下,对程序最近一次索引修改提交操作时的文件引用进行备份,从而能建立一个连续的索引备份镜像。那么你也许会有疑问,在备份期间,索引出现变化怎么办呢?
这就是SnapshotDeletionPolicy的牛逼之处,在使用SnapshotDeletionPolicy.snapshot()获取快照之后,索引更新提交时刻的所有文件引用都不会被IndexWriter删除,只要IndexWriter并未关闭,即使IndexWriter在进行更新、优化操作等也不会删除这些文件。如果说索引拷贝过程耗时较长也不会出现问题,因为被拷贝的文件时索引快照,在快照的有效期,其引用的文件会一直存在于磁盘上。
所以在备份期间,索引会比通常情况下占用更大的磁盘空间,当索引备份完成后,可以调用SnapshotDeletionPolicy.release (IndexCommit commit) 释放指定的某次提交,以便让IndexWriter删除这些已被关闭或下次将要更新的文件。
需要注意的是,Lucene对索引文件的写入操作是一次性完成的。这意味着你可以简单通过文件名比对来完成对索引的增量备份备份,你不必查看每个文件的内容,也不必查看该文件上次被修改的时间戳,因为一旦程序从快照中完成文件写入和引用操作,这些文件就不会改变了。
segments.gen文件在每次程序提交索引更新时都会被重写,因此备份模块必须要经常备份该文件,但是在Lucene 6.0中注意segments.gen已经从Lucene索引文件格式中移除,所以无需单独考虑segments.gen的备份策略了。在备份期间,write.lock文件不用拷贝。
SnapshotDeletionPolicy类有两个使用限制
索引热备份解决方案
import org.apache.commons.io.FileUtils; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.*; import org.apache.lucene.index.*; import org.apache.lucene.search.TermQuery; import org.apache.lucene.store.FSDirectory; import java.io.File; import java.io.IOException; import java.nio.file.Paths; import java.util.Collection; import static org.apache.lucene.document.Field.Store.YES; /** * 测试Lucene索引热备 */ public class TestIndexBackupRecovery { public static void main(String[] args) throws IOException, InterruptedException { String f = "D:/index_test"; String d = "D:/index_back"; IndexWriterConfig indexWriterConfig = new IndexWriterConfig(new StandardAnalyzer()); indexWriterConfig.setIndexDeletionPolicy(new SnapshotDeletionPolicy(new KeepOnlyLastCommitDeletionPolicy())); IndexWriter writer = new IndexWriter(FSDirectory.open(Paths.get(f)), indexWriterConfig); Document document = new Document(); document.add(new StringField("ID", "111", YES)); document.add(new IntPoint("age", 111)); document.add(new StoredField("age", 111)); writer.addDocument(document); writer.commit(); document = new Document(); document.add(new StringField("ID", "222", Field.Store.YES)); document.add(new IntPoint("age", 333)); document.add(new StoredField("age", 333)); writer.addDocument(document); document.add(new StringField("ID", "333", Field.Store.YES)); document.add(new IntPoint("age", 555)); document.add(new StoredField("age", 555)); for (int i = 0; i 1000; i++) { document = new Document(); document.add(new StringField("ID", "333", YES)); document.add(new IntPoint("age", 1000000 + i)); document.add(new StoredField("age", 1000000 + i)); document.add(new StringField("desc", "ABCDEFG" + i, YES)); writer.addDocument(document); } writer.deleteDocuments(new TermQuery(new Term("ID", "333"))); writer.commit(); backupIndex(writer, f, d); writer.close(); } public static void backupIndex(IndexWriter indexWriter, String indexDir, String backupIndexDir) throws IOException { IndexWriterConfig config = (IndexWriterConfig) indexWriter.getConfig(); SnapshotDeletionPolicy snapshotDeletionPolicy = (SnapshotDeletionPolicy) config.getIndexDeletionPolicy(); IndexCommit snapshot = snapshotDeletionPolicy.snapshot(); //设置索引提交点,默认是null,会打开最后一次提交的索引点 config.setIndexCommit(snapshot); Collection fileNames = snapshot.getFileNames(); File[] dest = new File(backupIndexDir).listFiles(); String sourceFileName; String destFileName; if (dest != null && dest.length > 0) { //先删除备份文件中的在此次快照中已经不存在的文件 for (File file : dest) { boolean flag = true; //包括文件扩展名 destFileName = file.getName(); for (String fileName : fileNames) { sourceFileName = fileName; if (sourceFileName.equals(destFileName)) { flag = false; break;//跳出内层for循环 } } if (flag) { file.delete();//删除 } } //然后开始备份快照中新生成的文件 for (String fileName : fileNames) { boolean flag = true; sourceFileName = fileName; for (File file : dest) { destFileName = file.getName(); //备份中已经存在无需复制,因为Lucene索引是一次写入的,所以只要文件名相同不要要hash检查就可以认为它们的数据是一样的 if (destFileName.equals(sourceFileName)) { flag = false; break; } } if (flag) { File from = new File(indexDir + File.separator + sourceFileName);//源文件 File to = new File(backupIndexDir + File.separator + sourceFileName);//目的文件 FileUtils.copyFile(from, to); } } } else { //备份不存在,直接创建 for (String fileName : fileNames) { File from = new File(indexDir + File.separator + fileName);//源文件 File to = new File(backupIndexDir + File.separator + fileName);//目的文件 FileUtils.copyFile(from, to); } } snapshotDeletionPolicy.release(snapshot); //删除已经不再被引用的索引提交记录 indexWriter.deleteUnusedFiles(); } }
首先索引里都存了些什么呢?一个索引包含一个documents的序列,一个document是一个fields的序列,一个field是一个有名的terms序列,一个term是一个比特序列。
根据 Summary of File Extensions的说明,目前Lucene 6.0中存在的索引格式如下
Name | Extension | Brief Description |
---|---|---|
Segments File | segments_N | Stores information about a commit point |
Lock File | write.lock | The Write lock prevents multiple IndexWriters from writing to the same file |
Segment Info | .si | Stores metadata about a segment |
Compound File | .cfs, .cfe | An optional “virtual” file consisting of all the other index files for systems that frequently run out of file handles |
Fields | .fnm | Stores information about the fields |
Field Index | .fdx | Contains pointers to field data |
Field Data | .fdt | The stored fields for documents |
Term Dictionary | .tim | The term dictionary, stores term info |
Term Index | .tip | The index into the Term Dictionary |
Frequencies | .doc | Contains the list of docs which contain each term along with frequency |
Positions | .pos | Stores position information about where a term occurs in the index |
Payloads | .pay | Stores additional per-position metadata information such as character offsets and user payloads |
Norms | .nvd, .nvm | Encodes length and boost factors for docs and fields |
Per-Document Values | .dvd, .dvm | Encodes additional scoring factors or other per-document information |
Term Vector Index | .tvx | Stores offset into the document data file |
Term Vector Documents | .tvd | Contains information about each document that has term vectors |
Term Vector Fields | .tvf | The field level info about term vectors |
Live Documents | .liv | Info about what files are live |
Point values | .dii, .dim | Holds indexed points, if any |
在Lucene索引结构中,既保存了正向信息,也保存了反向信息。
正向信息存储在:段(segments_N)->field(.fnm/.fdx/.fdt)->term(.tvx/.tvd/.tvf)
反向信息存储在:词典(.tim)->倒排表(.doc/.pos)
如果想要查看中文,可以参考 这里。
恢复索引步骤如下
通常Lucene能够很好地避免大多数常见错误,如果程序遇到磁盘空间已满或者OutOfMemoryException异常,那么它只会丢失内存缓冲中的文档,已经编入索引的文档将会完好保存下来,并且索引也会保持原样。这个结果对于以下情况同样适用:如出现JVM崩溃,或者程序碰到无法控制的异常,或者程序进程被终止,或者操作系统崩溃,或者计算机突然断电等。
/** * @param source 索引源 * @param dest 索引目标 * @param indexWriterConfig 配置相关 */public static void recoveryIndex(String source, String dest, IndexWriterConfig indexWriterConfig) { IndexWriter indexWriter = null; try { indexWriter = new IndexWriter(FSDirectory.open(Paths.get(dest)), indexWriterConfig); } catch (IOException e) { log.error("", e); } finally { //说明IndexWriter正常打开了,无需恢复 if (indexWriter != null && indexWriter.isOpen()) { try { indexWriter.close(); } catch (IOException e) { log.error("", e); } } else { //说明IndexWriter已经无法打开,使用备份恢复索引 //此处简单操作,先清空损坏的索引文件目录,如果索引特别大,可以比对每个文件,不必全部删除 try { FileUtils.deleteDirectory(new File(dest)); FileUtils.copyDirectory(new File(source), new File(dest)); } catch (IOException e) { log.error("", e); //使用备份恢复出错,那么就使用最后一招修复索引 log.info("Check index {} now!", dest); try { IndexUtils.checkIndex(dest); } catch (IOException | InterruptedException e1) { log.error("Check index error!", e1); } } } } }
当其它所有方法都无法解决索引损坏问题时,你的最后一个选项就是使用CheckIndex工具了。该工具除了能汇报索引细节状况以外,还能完成修复索引的工作。该工具会强制删除索引中出现问题的段,需要注意的是,该操作还会全部删除这些段包含的文档,该工具的使用目标应主要着眼于能够在紧急状况下让搜索程序再次运行起来,一旦我们进行了索引备份,并且备份完好,应优先使用恢复索引,而不是修复索引。
/** * CheckIndex会检查索引中的每个字节,所以当索引比较大时,此操作会比较耗时 * * @throws IOException * @throws InterruptedException */ public void checkIndex(String indexFilePath) throws IOException, InterruptedException { CheckIndex checkIndex = new CheckIndex(FSDirectory.open(Paths.get(indexFilePath))); checkIndex.setInfoStream(System.out); CheckIndex.Status status = checkIndex.checkIndex(); if (status.clean) { System.out.println("Check Index successfully!"); } else { //产生索引中的某个文件之后再次测试 System.out.println("Starting repair index files..."); //该方法会向索引中写入一个新的segments文件,但是并不会删除不被引用的文件,除非当你再次打开IndexWriter才会移除不被引用的文件 //该方法会移除所有存在错误段中的Document索引文件 checkIndex.exorciseIndex(status); checkIndex.close(); //测试修复完毕之后索引是否能够打开 IndexWriter indexWriter = new IndexWriter(FSDirectory.open(Paths.get(indexFilePath)), new IndexWriterConfig(new StandardAnalyzer())); System.out.println(indexWriter.isOpen()); indexWriter.close(); } }
如果索引完好,输出如下信息:
Segments file=segments_2 numSegments=2 version=6.0.0 id=2jlug2dgsc4tmgkf5rck1xgm4 1 of 2: name=_0 maxDoc=1 version=6.0.0 id=2jlug2dgsc4tmgkf5rck1xgm1 codec=Lucene60 compound=true numFiles=3 size (MB)=0.002 diagnostics = {java.runtime.version=1.8.0_45-b15, java.vendor=Oracle Corporation, java.version=1.8.0_45, java.vm.version=25.45-b02, lucene.version=6.0.0, os=Windows 7, os.arch=amd64, os.version=6.1, source=flush, timestamp=1464159740092} no deletions test: open reader………OK [took 0.051 sec] test: check integrity…..OK [took 0.000 sec] test: check live docs…..OK [took 0.000 sec] test: field infos………OK [2 fields] [took 0.000 sec] test: field norms………OK [0 fields] [took 0.000 sec] test: terms, freq, prox…OK [1 terms; 1 terms/docs pairs; 0 tokens] [took 0.007 sec] test: stored fields…….OK [2 total field count; avg 2.0 fields per doc] [took 0.008 sec] test: term vectors……..OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec] test: docvalues………..OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec] test: points…………..OK [1 fields, 1 points] [took 0.001 sec] 2 of 2: name=_1 maxDoc=1001 version=6.0.0 id=2jlug2dgsc4tmgkf5rck1xgm3 codec=Lucene60 compound=true numFiles=4 size (MB)=0.023 diagnostics = {java.runtime.version=1.8.0_45-b15, java.vendor=Oracle Corporation, java.version=1.8.0_45, java.vm.version=25.45-b02, lucene.version=6.0.0, os=Windows 7, os.arch=amd64, os.version=6.1, source=flush, timestamp=1464159740329} has deletions [delGen=1] test: open reader………OK [took 0.003 sec] test: check integrity…..OK [took 0.000 sec] test: check live docs…..OK [1000 deleted docs] [took 0.000 sec] test: field infos………OK [3 fields] [took 0.000 sec] test: field norms………OK [0 fields] [took 0.000 sec] test: terms, freq, prox…OK [1 terms; 1 terms/docs pairs; 0 tokens] [took 0.008 sec] test: stored fields…….OK [2 total field count; avg 2.0 fields per doc] [took 0.012 sec] test: term vectors……..OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec] test: docvalues………..OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec] test: points…………..OK [1 fields, 1001 points] [took 0.001 sec] No problems were detected with this index. Took 0.159 sec total. Check Index successfully!
在破坏索引之后(删除了一个cfe文件),再次运行输出
Segments file=segments_2 numSegments=2 version=6.0.0 id=2jlug2dgsc4tmgkf5rck1xgm4 1 of 2: name=_0 maxDoc=1 version=6.0.0 id=2jlug2dgsc4tmgkf5rck1xgm1 codec=Lucene60 compound=true numFiles=3 FAILED WARNING: exorciseIndex() would remove reference to this segment; full exception: java.nio.file.NoSuchFileException: D:index_test_0.cfe at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:79) … … at java.lang.reflect.Method.invoke(Method.java:497) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144) 2 of 2: name=_1 maxDoc=1001 version=6.0.0 id=2jlug2dgsc4tmgkf5rck1xgm3 codec=Lucene60 compound=true numFiles=4 size (MB)=0.023 diagnostics = {java.runtime.version=1.8.0_45-b15, java.vendor=Oracle Corporation, java.version=1.8.0_45, java.vm.version=25.45-b02, lucene.version=6.0.0, os=Windows 7, os.arch=amd64, os.version=6.1, source=flush, timestamp=1464159740329} has deletions [delGen=1] test: open reader………OK [took 0.059 sec] test: check integrity…..OK [took 0.000 sec] test: check live docs…..OK [1000 deleted docs] [took 0.001 sec] test: field infos………OK [3 fields] [took 0.000 sec] test: field norms………OK [0 fields] [took 0.000 sec] test: terms, freq, prox…OK [1 terms; 1 terms/docs pairs; 0 tokens] [took 0.013 sec] test: stored fields…….OK [2 total field count; avg 2.0 fields per doc] [took 0.016 sec] test: term vectors……..OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec] test: docvalues………..OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec] test: points…………..OK [1 fields, 1001 points] [took 0.002 sec] WARNING: 1 broken segments (containing 1 documents) detected Took 0.165 sec total. Starting repair index files… true
Lucene 6.0 实战:索引热备份及恢复,首发于 文章 - 伯乐在线。