Here are some things to try to speed up the seaching speed of your Lucene application. Please see ImproveIndexingSpeed for how to speed up indexing.
以下是一些尝试提高lucene程序检索速度的方法. 如果需要提高索引速度,请看提高索引速度.
Be sure you really need to speed things up. Many of the ideas here are simple to try, but others will necessarily add some complexity to your application. So be sure your searching speed is indeed too slow and the slowness is indeed within Lucene.
必须确定你真的需要提升检索速度.以下的方法多数是简单易用的,但有些却可能会增加你程序的复杂度.所以你必须确定你的检索速度真的过慢,而且真的是由Lucene引起的.
Make sure you are using the latest version of Lucene.
确定你正在使用的是最新版的Lucene.
Use a local filesystem. Remote filesystems are typically quite a bit slower for searching. If the index must be remote, try to mount the remote filesystem as a "readonly" mount. In some cases this could improve performance.
使用本地文件. 远程文件的检索通常会比本地检索慢一点.如果索引必须放在远程服务器上,可以把远程文件设置为"只读".有些情况下这样做会提高效率.
Get faster hardware, especially a faster IO system. Flash-based Solid State Drives works very well for Lucene searches. As seek-times for SSD's are about 100 times faster than traditional platter-based harddrives, the usual penalty for seeking is virtually eliminated. This means that SSD-equipped machines need less RAM for file caching and that searchers require less warm-up time before they respond quickly.
升级到更快的硬件,特别是用于IO的硬件. 内置闪存芯片的固态硬盘会更利于Lucene的检索.固态硬盘的寻址速度是传统磁碟硬盘的100倍,平常的硬盘寻址损失会明显减少.这意味着配置了固态硬盘的机器对用来缓存文件的内存(RAM)依赖减少,而且检索用户也无需再等待索引文件从硬盘读入内存的这段缓存时间消耗.
Tune the OS
One tunable that stands out on Linux is swappiness (http://kerneltrap.org/node/3000), which controls how aggressively the OS will swap out RAM used by processes in favor of the IO Cache. Most Linux distros default this to a highish number (meaning, aggressive) but this can easily cause horrible search latency, especially if you are searching a large index with a low query rate. Experiment by turning swappiness down or off entirely (by setting it to 0). Windows also has a checkbox, under My Computer -> Properties -> Advanced -> Performance Settings -> Advanced -> Memory Usage, that lets you favor Programs or System Cache, that's likely doing something similar.
调整操作系统
一个在Linux下的可调整部分是交换系统(http://kerneltrap.org/node/3000),它会控制操作系统对腾出内存来处理IO缓存的积极性.大多数Linux会默认设置一个最大值(就是比较积极的缓存索引),但这样很容易会引起严重的检索延时,特别是当你检索一个不经常使用的大索引文件时.(Clotho注:例如有一个1G的索引文件,但一个月才检索一次,如果交换值设置太高,每次检索都整个1G的文件被载入内存,之后又不再使用,就很浪费时间和内存空间).可以尝试将交换值调低或者关闭交换系统(设置为0).Windows也有这个选项,在"我的电脑->右键菜单的"属性"->高级->性能->设置->高级->内存使用"里,可以设置程序或者系统缓存,作用应该是和Linux的交换系统类似.
Open the IndexReader with readOnly=true. This makes a big difference when multiple threads are sharing the same reader, as it removes certain sources of thread contention.
用只读模式(readOnly=true)来调用IndexReader. 当多线程共享同一个reader时这样会有很大不同,肯定会减少一部分线程同步的资源占用.
On non-Windows platform, using NIOFSDirectory instead of FSDirectory.
This also removes sources of contention when accessing the underlying files. Unfortunately, due to a longstanding bug on Windows in Sun's JRE (http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734 -- feel particularly free to go vote for it), NIOFSDirectory gets poor performance on Windows.
在非Windows的操作系统中,用NIOFSDirectory类代替FSDirectory类.
这样也可以减少访问底层文件时的资源争抢.很不幸地,作为一个SUN的JRE在Windows下存在已久的bug(http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734 -- 就当是特殊的免费投票),NIOFSDirectory类在Windows下的性能很低.
Add RAM to your hardware and/or increase the heap size for the JVM. For a large index, searching can use a lot of RAM. If you don't have enough RAM or your JVM is not running with a large enough HEAP size then the JVM can hit swapping and thrashing at which point everything will run slowly.
增加硬件内存和/或增加JVM的堆大小. 当检索一个大索引文件时,会占用很多内存空间.如果你的内存不够大,或者你的JVM没有足够的堆空间,JVM会触发虚拟内存页交换和重新生成,令所有工作变得缓慢.
Use one instance of IndexSearcher.
Share a single IndexSearcher across queries and across threads in your application.
使用单个IndexSearcher实例.
程序中多条线程共享一个单独IndexSearcher来进行检索.
When measuring performance, disregard the first query.
The first query to a searcher pays the price of initializing caches (especially when sorting by fields) and thus will skew your results (assuming you re-use the searcher for many queries). On the other hand, if you re-run the same query again and again, results won't be realistic either, because the operating system will use its cache to speed up IO operations. On Linux (kernel 2.6.16 and later) you can clean the disk cache using sync ; echo 3 > /proc/sys/vm/drop_caches. See http://linux-mm.org/Drop_Caches for details.
如果测试性能,可以忽略第一次检索的测试结果.
第一次检索需要花时间来初始化缓存(特别是当检索结果按字段排序),因此会影响你的测试结果(假设你会重复使用该检索器进行查询).另一方面,如果你重复同一个查询多次,测试结果也不一定符合实际,因为操作系统本身会使用缓存来加速IO操作.在Linux下(内核版本2.6.16或更高)你可以通过这条命令来清空硬盘缓存"sync ; echo 3 > /proc/sys/vm/drop_caches".详情可以参考http://linux-mm.org/Drop_Caches.
Re-open the IndexSearcher only when necessary.
You must re-open the IndexSearcher in order to make newly committed changes visible to searching. However, re-opening the searcher has a certain overhead (noticeable mostly with large indexes and with sorting turned on) and should thus be minimized. Consider using a so called warming technique which allows the searcher to warm up its caches before the first query hits.
除非不得已,否则尽量不要重新打开IndexSearcher.
你需要是打开IndexSearcher来提交新的修改给检索.然而,重新打开检索器会产生相当高的资源消耗(由于大索引文件和排序的启用),因此请尽量减少重新打开的次数.可以考虑使用warming技术,它允许检索器在第一次检索之前预先载入缓存.
Run optimize on your index before searching. An optimized index has only 1 segment to search which can be much faster than the many segments that will normally be created, especially for a large index. If your application does not often update the index then it pays to build the index, optimize it, and use the optimized one for searching. If instead you are frequently updating the index and then refreshing searchers, then optimizing will likely be too costly and you should decrease mergeFactor instead.
在检索之前调用IndexWriter的optimize方法来优化你的索引文件. 一个优化后的索引只有一个索引段文件,检索起来会比多个索引段文件快很多,正常未经优化的索引通常有多个索引段文件,特别是大索引更加多索引段.如果你的程序不经常更新索引的话,最好优化生成单个索引来检索.相反,如果你要经常更新索引和刷新检索器的话,调用优化反而会增加开销,这时你需要降低mergeFactor参数的值.
Decrease mergeFactor. Smaller mergeFactors mean fewer segments and searching will be faster. However, this will slow down indexing speed, so you should test values to strike an appropriate balance for your application.
降低mergeFactor参数的值. 更小的mergeFactors值会生成着更少的索引段文件,检索起来会更快.然而这样会降低索引的速度,所以你要为你的程序定出一个令检索和索引速度平衡的mergeFactors值.
Limit usage of stored fields and term vectors. Retrieving these from the index is quite costly. Typically you should only retrieve these for the current "page" the user will see, not for all documents in the full result set. For each document retrieved, Lucene must seek to a different location in various files. Try sorting the documents you need to retrieve by docID order first.
限制储存字段和词项向量的使用. 从索引中获取这些数据的开销比较大.通常的解决方法是你只获取用户可见的当前"页"的结果数,而非结果集合中的全部文档.因为每获取一个结果文档,Lucene就必须查找多个文件的不同位置.可以尝试对需要获取的文档按docID排序.
Use FieldSelector to carefully pick which fields are loaded, and how they are loaded, when you retrieve a document.
当你获取一个文档时,用FieldSelector类仔细的操控哪些字段要读取,怎么读取.
Don't iterate over more hits than needed.
Iterating over all hits is slow for two reasons. Firstly, the search() method that returns a Hits object re-executes the search internally when you need more than 100 hits. Solution: use the search method that takes a HitCollector instead. Secondly, the hits will probably be spread over the disk so accessing them all requires much I/O activity. This cannot easily be avoided unless the index is small enough to be loaded into RAM. If you don't need the complete documents but only one (small) field you could also use the FieldCache class to cache that one field and have fast access to it.
不要遍历比用户需求数更多的结果.
遍历全部结果会很慢的原因有两个.首先,当你需要多于100个结果以上时search()方法会在内部重新检索并返回一个新的结果对象.解决方法:用HitCollector代替原来的检索结果类.其次,那些结果数据可能会遍布在磁盘各处,所以访问它们需要多次I/O操作.这点不能轻易忽视,除非索引文件小到能够整个放入内存.如果你不需要用到整个文档而只要其中一个(小的)字段,你也可以使用FieldCache类来缓存那个字段来加快访问它的速度.
When using fuzzy queries use a minimum prefix length.
Fuzzy queries perform CPU-intensive string comparisons - avoid comparing all unique terms with the user input by only examining terms starting with the first "N" characters. This prefix length is a property on both QueryParser and FuzzyQuery - default is zero so ALL terms are compared.
使用模糊查询时最好用一个最小的预先指定的长度值.
模糊查询会执行精密的CPU字符串比较 - 尽量避免比较用户输入的全部的唯一词项,而只比较词项的前N个字符.预先指定的长度值是一个属性,QueryParser和FuzzyQuery都有这个属性 - 默认是0,所以全部词项都会进行对比.
Consider using filters. It can be much more efficient to restrict results to a part of the index using a cached bit set filter rather than using a query clause. This is especially true for restrictions that match a great number of documents of a large index. Filters are typically used to restrict the results to a category but could in many cases be used to replace any query clause. One difference between using a Query and a Filter is that the Query has an impact on the score while a Filter does not.
考虑使用filters. 它使用缓存后的位集合过滤器代替查询语句可以更加有效率地限制结果数量.这样做对于大索引中的大批量文档匹配的情况特别有效.过滤器通常被用来限制结果的类别,但很多情况下也可以用来代替查询语句.使用查询和过滤的区别是,查询的结果会带有权重值而过滤没有.
Find the bottleneck.
Complex query analysis or heavy post-processing of results are examples of hidden bottlenecks for searches. Profiling with at tool such as VisualVM helps locating the problem.
找出瓶颈.
复杂的查询分析或大结果量的处理潜藏着很多检索瓶颈.可以使用VisualVM等工具来检测和定位出瓶颈所在.
原文地址: http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
一位牛人的翻译版本: http://hi.baidu.com/expertsearch/blog/item/2195a237bfe83d360a55a9fd.html