参考:
http://106.186.120.253/preview/search-in-depth.html
https://www.elastic.co/guide/en/elasticsearch/guide/current/search-in-depth.html
利用词的相似度(word proximity)、部分匹配(partial matching)、模糊匹配(fuzzy matching)及语言感知(language awareness)。
理解每个查询如何贡献相关度评分 _score 有助于调试我们的查询:确保我们认为的最佳匹配文档出现在结果首页,以及削减结果中几乎不相关的 “长尾(long tail)”。
搜索不仅仅是全文搜索:很大一部分数据都是结构化的,如日期和数字。
结构化搜索:查询具有内在结构数据的过程,如日期、时间、数字,有精确的格式,可对其逻辑操作,如数字、时间的范围,大小比较。
文本也可以是结构化的。如颜色有红、绿等。
结构化查询的结果:yes 或 no,无相关度或评分的概念。
会用filters ,无评分过程,执行快,性能高。
term查询:处理number、boolean、date、text.
POST /my_store/products/_bulk
{ "index": { "_id": 1 }}
{ "price" : 10, "productID" : "XHDK-A-1293-#fJ3" }
{ "index": { "_id": 2 }}
{ "price" : 20, "productID" : "KDKE-B-9947-#kL5" }
{ "index": { "_id": 3 }}
{ "price" : 30, "productID" : "JODL-X-1937-#pV7" }
{ "index": { "_id": 4 }}
{ "price" : 30, "productID" : "QQPX-R-3956-#aD8" }
//初级L1
{
"query": {
"term": {
"price": 20
}
}
}
//升级L2:
{
"query" : {
"constant_score" : { //非评分模式执行term查询,将term查询转为过滤器
"filter" : {//constant_score必须跟filter
"term" : {
"price" : 20
}
}
}
}
}
// 查询置于 filter 语句内不进行评分或相关度的计算,所有的结果返回默认评分1。
term 查询文本
//通过分析API,可知已被分词
GET my_store/_analyze
{
"field": "productID",
"text": "XHDK-A-1293-#fJ3"
}
结果:
{
"tokens" : [ {
"token" : "xhdk",
"start_offset" : 0,
"end_offset" : 4,
"type" : "" ,
"position" : 1
}
// ...
分词后:
1. 用多个token,而不是单个表示字段
2. 所有字母都小写了
3. 丢失连字符和哈希符
当直接用term精准查找时,无结果的,term查询词不再倒排索引中!
DELETE my_store 字段类型不可修改,只好①先删②重建③重导数据④term查询
PUT my_store
{
"mappings" : {
"products" : {
"properties" : {
"productID" : {
"type" : "string",
"index" : "not_analyzed" //不分词
}
}
}
}
}
非评分查询时执行的多个操作:
查找匹配文档. 倒排表中,找所有匹配的文档。
创建 bitset.
filter为每个非评分的查询创建一个bitset数组(值为0或1),描述哪个文档包含该term,匹配的标志位是1,如编号1~4的文档中,只有编号1的匹配,则值为bitset=[1,0,0,0]。在内部,它表示成一个”roaring bitmap“,可以同时对稀疏或密集的集合进行高效编码。
迭代 bitset(s)
一旦为每个查询生成了 bitsets ,Elasticsearch 就会循环迭代 bitsets 从而找到满足所有过滤条件的匹配文档的集合。执行顺序是启发式的,但一般来说先迭代稀疏的 bitset (因为它可以排除掉大量的文档)。
增量使用计数.
ES 能够缓存非评分查询从而获取更快的访问,但是它也会缓存一些使用极少的东西——资源浪费。
为此 ES 会为每个索引跟踪保留查询使用的历史状态。如果查询在最近的 256 次查询中会被用到,那么它就会被缓存到内存中。当 bitset 被缓存后,缓存会在那些低于 10,000 个文档(或少于 3% 的总索引数)的段(segment)中被忽略。因为这些小的段即将会merge,不必分配缓存。
从概念上记住非评分计算是首先执行的,这将有助于写出高效又快速的搜索请求。
场景:过滤多个字段或值。
工具:bool filter
模式:bool query is composed of four sections
{
"bool" : {//每一section都是optional,每一section有一个或一组 query
"must" : [], // = and ,全部yes , 所有语句都必须匹配
"should" : [], // = or ,任一yes , 至少一个语句要匹配
"must_not" : [], // = not , 全部no , 所有语句都不能匹配
"filter": [] // must匹配,即 and 匹配,但 run in non-scoring,filtering mode,当然在constant_score下再讨论filter无意义,已经非评分模式了。
}
}
GET /my_store/products/_search
{
"query" : {
"constant_score" : { //ES升级2.x :之前filteredt已被constant_score替换,constant_score --> non-scoring
"filter" : {
"bool" : {
"should" : [//多个子句用数组 [...]
{ "term" : {"price" : 20}},
{ "term" : {"productID" : "XHDK-A-1293-#fJ3"}}
],
"must_not" : {//一个子句用对象 -- {...}
"term" : {"price" : 30}
}
}
}
}
}
}
GET /my_store/products/_search
{
"query" : {
"constant_score" : {
"filter" : {
"bool" : {
"should" : [//should一个子句是bool查询:任一
{ "term" : {"productID" : "KDKE-B-9947-#kL5"}},
{ "bool" : {
"must" : [
{ "term" : {"productID" : "JODL-X-1937-#pV7"}},
{ "term" : {"price" : 30}}
]
}}
]
}
}
}
}
}
Ref https://www.elastic.co/guide/en/elasticsearch/reference/2.1/query-dsl-bool-query.html
term —->single value.
terms —->multiple value,
两者都是:contains any,not equal ,the nature of an inverted index also means that entire field equality is rather difficult to calculate.
{
"terms" : {//注意:contains any,not equal
"price" : [20, 30]
}
}
{
"query" : {
"constant_score" : {//filter子句内
"filter" : {
"terms" : {
"price" : [20, 30]
}
}
}
}
}
Equals Exactly 加个字段统计term个数
{ "tags" : ["search"], "tag_count" : 1 }
{ "tags" : ["search", "open_source"], "tag_count" : 2 }
....
{
"query": {
"constant_score" : {
"filter" : {
"bool" : {
"must" : [
{ "term" : { "tags" : "search" } },
{ "term" : { "tag_count" : 1 } }
]
}
}
}
}
}
//query -->constant_score -->filter
"range" : {
"price" : {
"gte" : 20,
"lt" : 40
}
}
//date字段类型,支持date math 操作:
"gt" : "now-1h" //相对
"lt" : "2014-01-01 00:00:00||+1M" //绝对
字符串比较
字典序、按字母地。
Terms in the inverted index are sorted in lexicographical order, which is why string ranges use this order.
Be Careful of Cardinality(基数):
Numeric and date fields are indexed in such a way that ranges are efficient to calculate.
string field,ES is effectively performing a term filter for every term that falls in the range. This is much slower than a date or numeric range.String ranges are fine on a field with low cardinality—a small number of unique terms.
query constant_score_filter
),have any value in the specified field. "exists" : { "field" : "tags" }
仅一个null值,或无值,排除在外,但若包含null和其它值,不会排除在外的。"missing" : { "field" : "tags" }
区分null值和压根不存在字段值
string, numeric, Boolean, or date field, you can also set a null_value that will be used whenever an explicit null value is encountered.
exists/missing on Objects
The exists and missing queries also work on inner objects, not just core types.
{
"name" : {
"first" : "John",
"last" : "Smith"
}
}
本质:
{
"name.first" : "John",
"name.last" : "Smith"
}
查询
{
"exists" : { "field" : "name" }
}
本质:
{
"bool": {
"should": [
{ "exists": { "field": "name.first" }},
{ "exists": { "field": "name.last" }}
]
}
}
Bitset representing which documents match the filter.
Once cached, these bitsets can be reused wherever the same query is used, without having to reevaluate the entire query again.
Bitsets are “smart”: they are updated incrementally.
Independent Query Caching
once cached, a query can be reused in multiple search requests.
It is not dependent on the “context” of the surrounding query. This allows caching to accelerate the most frequently used portions of your queries
, without wasting overhead on the less frequent / more volatile portions.
GET /inbox/emails/_search
{
"query": {
"constant_score": {
"filter": {
"bool": {
"should": [
{ "bool": {
"must": [
{ "term": { "folder": "inbox" }}, //1
{ "term": { "read": false }}
]
}},
{ "bool": {
"must_not": {
"term": { "folder": "inbox" } //2,虽然分别在must、must_not子句,但1、2等价,复用bitset
},
"must": {
"term": { "important": true }
}
}}
]
}
}
}
}
}
Autocaching Behavior
就算在filter中,不一定就缓存。
ES早期版本:cache everything that was cacheable.
Many filters are very fast to evaluate, but substantially slower to cache (and reuse from cache). These filters don’t make sense to cache, since you’d be better off just re-executing the filter again.
Inspecting the inverted index is very fast(快) and most query components are rare(大多查询罕见).
Consider a term filter on a “user_id” field: if you have millions of users, any particular user ID will only occur rarely.
Elasticsearch caches queries automatically based on usage frequency
. If a non-scoring query has been used a few times (dependent on the query type) in the last 256 queries
, the query is a candidate for caching. However, not all segments are guaranteed to cache the bitset. Only segments that hold more than 10,000
documents (or 3%
of the total documents, whichever is larger) will cache the bitset. Because small segments are fast to search and merged out quickly, it doesn’t make sense to cache bitsets here.
Once cached, a non-scoring bitset will remain in the cache until it is evicted. Eviction is done on an LRU basis.