先撇开其他的不谈,我们先看看下面几段代码,他们均能实现“实时”检索。
注意:
1.笔者目前采用的lucene版本为3.5.
2.为了检查是否“实时”,采用了numDocs是否发生变化进行简易判断。
3.请正确理解这里的提到的“实时”,并与“准实时”予以区分。
方式一:indexWriter每次都commit,indexReader每次都open(dir)
public void nrtOpenDir() {
try {
Document doc = new Document();
Field f = new Field("f", "test", Store.YES, Index.ANALYZED);
doc.add(f);
for (int i = 0; i < 20; i++) {
w.addDocument(doc);
w.commit();
IndexReader r = IndexReader.open(dir);
System.out.println(r.numDocs());
}
} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
以上方式,就小数据量测试而言,是可以获得“实时”检索的效果。问题有两个:
1.在大数据量的情况下indexWriter.commit()很耗时!(This may be a costly operation, so you should test the cost in your application and do it only when really necessary.)
2.在大数据量的情况下indexReader.open()很耗时!
因此,不要在实际项目中使用以上这种方式!(注意!!!!!!)
方式二:indexWriter每次都commit,indexReader每次都reopen()
/**
* reopen -> openIfChanged
*/
public void nrtReopen() {
try {
Document doc = new Document();
Field f = new Field("f", "test", Store.YES, Index.ANALYZED);
doc.add(f);
IndexReader r = IndexReader.open(dir);
for (int i = 0; i < 20; i++) {
w.addDocument(doc);
w.commit();
// r = r.reopen();
r = IndexReader.openIfChanged(r);
System.out.println(r.numDocs());
}
} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
在3.5中openIfChanged(r)代替了reopen()方法。
open(dir)的确是个很费时的过程,openIfChanged会比open省时些,因为他只刷新增那部分内容。(Opening an IndexReader is an expensive operation. This method can be used to refresh an existing IndexReader to reduce these costs. This method tries to only load segments that have changed or were created after the IndexReader was (re)opened.)
不过该方式任然需要commit,因此也不建议使用!!!
方式三:indexWriter不用每次commit,indexReader每次都open(indexWriter)
public void nrtNRT() {
try {
Document doc = new Document();
Field f = new Field("f", "test", Store.YES, Index.ANALYZED);
doc.add(f);
for (int i = 0; i < 20; i++) {
w.addDocument(doc);
// IndexReader r = w.getReader();
IndexReader r = IndexReader.open(w, false);
System.out.println(r.numDocs());
}
} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
方式三中没有了commit操作,那么IndexReader.open(w,false)和IndexReader.openIfChanged(r)效率上又有什么区别呢?
下面我们做一个简单的实验!
实验只对IndexReader.open(w,false)和IndexReader.openIfChanged(r)的效率进行比较,主要代码如下:
openIfChanged(r)方式(方式a):
long bt = System.currentTimeMillis();
IndexReader r = IndexReader.open(dir);
for (int i = 0; i < readDocCount; i++) {
IndexReader nr = IndexReader.openIfChanged(r);
if (nr != null)
r = nr;
}
long et = System.currentTimeMillis();
System.out.println("reopen:" + (et - bt) + "ms");
open(w)方式(方式b):
long bt = System.currentTimeMillis();
IndexReader r = null;
for (int i = 0; i < readDocCount; i++) {
r = IndexReader.open(w, false);
}
long et = System.currentTimeMillis();
System.out.println("nrt:" + (et - bt) + "ms");
建立一个简单的索引,就一个字段,然后添加100000个文档,测试发现方式a确实比方式b要快,随着不断的往索引中添加文档,
两种方式的耗时也有所增加。
1.NRT原理
When you ask for the IndexReader from the IndexWriter, the IndexWriter will be flushed (docs accumulated in RAM will be written to disk) but not committed (fsync files, write new segments file, etc). The returned IndexReader will search over previously committed segments, as well as the new, flushed but not committed segment. Because flushing will likely be processor rather than IO bound, this should be a process that can be attacked with more processor power if found to be too slow.
Also, deletes are carried in RAM, rather than flushed to disk, which may help in eeking a bit more speed. The result is that you can add and remove documents from a Lucene index in ‘near’ real time by continuously asking for a new Reader from the IndexWriter every second or couple seconds. I haven’t seen a non synthetic test yet, but it looks like its been tested at around 50 documents updates per second without heavy slowdown (eg the results are visible every second).
The patch takes advantage of LUCENE-1483, which keys FieldCaches and Filters at the individual segment level rather than at the index level – this allows you to only reload caches per segment rather then per index – essential for real-time search with filter/cache use.
2.NRT大数据量情况下的效率问题
3.solr中的NRT实现
Near realtime search means thats documents are available for search almost immediately after being indexed - additions and updates to documents are seen in 'near' realtime.
Near realtime search will be added to Solr in version 4.0 and is currently available on trunk.
You can now modify a commit command to be a 'soft' commit. A soft commit will avoid parts of the standard commit that can be costly. You still will want to do normal commits to ensure that documents are on stable storage, but soft commits allow users to see a very near realtime view of the index in the meantime. Be sure to pay special attention to cache and autowarm settings as they can have a significant impact on NRT performance.
-
You can read about soft commits here: http://wiki.apache.org/solr/UpdateXmlMessages#A.22commit.22_and_.22optimize.22
-
You can see how to auto soft commit here: http://wiki.apache.org/solr/SolrConfigXml?#Update_Handler_Section
A common configuration might be to 'hard' auto commit every 1-10 minutes and 'soft' auto commit every second. With this configuration, new documents will show up within about a second of being added, and if the power goes out, you will be certain to have a consistent index up to the last 'hard' commit.