问题
最近在使用sphinx 索引数据的时候, 经常索引不到想要的文档,得到的文档列表并不是最好的结果, 而明显更优的结果并没有显示。在经过查询后, 发现在Sphinx 的 SetLimit 设置上出了问题,我设置了cutoff 参数的值,
SetLimits ( $offset, $limit, $max_matches=1000, $cutoff=0)
$offset: 起始偏移量
$limit: 从$offset开始 获取的数量控制
$max_matches:控制服务端在当前请求中返回的数据的最大值
$cutoff: 控制查询的数量限制(当sphinx 的查询超过$cutoff 就停止查询)
官方文档
Sets offset into server-side result set ($offset) and amount of matches to return to client starting from that offset ($limit). Can additionally control maximum server-side result set size for current query ($max_matches) and the threshold amount of matches to stop searching at ($cutoff). All parameters must be non-negative integers.
First two parameters to SetLimits() are identical in behavior to MySQL LIMIT clause. They instruct searchd to return at most $limit matches starting from match number $offset. The default offset and limit settings are 0 and 20, that is, to return first 20 matches.
max_matches setting controls how much matches searchd will keep in RAM while searching. All matching documents will be normally processed, ranked, filtered, and sorted even if max_matches is set to 1. But only best N documents are stored in memory at any given moment for performance and RAM usage reasons, and this setting controls that N. Note that there are two places where max_matches limit is enforced. Per-query limit is controlled by this API call, but there also is per-server limit controlled by max_matches setting in the config file. To prevent RAM usage abuse, server will not allow to set per-query limit higher than the per-server limit.
You can't retrieve more than max_matches matches to the client application. The default limit is set to 1000. Normally, you must not have to go over this limit. One thousand records is enough to present to the end user. And if you're thinking about pulling the results to application for further sorting or filtering, that would be much more efficient if performed on Sphinx side.
$cutoff setting is intended for advanced performance control. It tells searchd to forcibly stop search query once $cutoff matches had been found and processed.
offset 和limit
$offset 和 $limit 就不说了, 类似于mysql的limit 的两个参数, 即定义了获取的数据的偏移量和数量, 一般分页的时候用到比较多。
max_matches
max_matches
这个参数与 cutoff
参数容易产生混淆max_matches
是控制最终返回的索引结果的最大数量, 比如, 搜索某个字符串, 一般最优的1000个结果就够了, 再往后就没有意义了,即便是百度或google, 也不会返回所有的索引结果。 而max_matches
就是控制这个1000的数量值的, 当然, 这个1000个的文档是如何得到呢, 他是根据搜索的字符串在sphinx的倒排列表中查找, 把所有满足条件的文档进行处理(排序,过滤), 然后再取出前max_matches
个, 即便你的max_matches
设置为1,sphinx还是要对所有满足条件的文档进程处理, 然后再取前max_matches
个。
这个参数可以在每次的请求中设置, 也可以在sphinx的配置文件中设置, 后者的优先级高一些, 即每次请求中的max_matches
不能高于sphinx 配置文件中的max_matches
cutoff
再看看cutoff
, 与max_matches
相比, 他也可以控制最终返回的最大数量, 不过与max_matches
不同的是, 他从文档列表的开始索引
满足条件的文档, 当达到cutoff
的时候, 就不对后面的文档进行索引了, 最终的处理(排序,过滤)也仅仅在这cutoff
中进行。
一旦设置了这个cutoff
, max_matches
的级别貌似就低了,会以cutoff
获取的文档作为处理(排序, 过滤)基础, 这也是为什么很多更优的文档我并没有索引到的原因,因为我设置了这个cutoff
, 去掉即可
那么, 什么时候应该用这个cutoff
呢, 很明显, cutoff
很大程度的减少了数据的索引量, 提高了性能,但导致的结果是索引的精度的很大损失, 但假如对索引的精度要求不高, 且不会导致的很大的精度损失的情况下, 可以通过cutoff
来提升索引性能