Elasticsearch

通用能力

查看文档数量

GET http://192.168.1.243:9200/_count

{
    "query": {
        "match_all": {}
    }
}

<<

{
    "count": 272053,
    "_shards": {
        "total": 2,
        "successful": 2,
        "skipped": 0,
        "failed": 0
    }
}

集群健康

GET /_cluster/health

<<

{
   "cluster_name":          "elasticsearch",
   "status":                "green", 
   "timed_out":             false,
   "number_of_nodes":       1,
   "number_of_data_nodes":  1,
   "active_primary_shards": 0,
   "active_shards":         0,
   "relocating_shards":     0,
   "initializing_shards":   0,
   "unassigned_shards":     0
}

status 字段指示着当前集群在总体上是否工作正常。它的三种颜色含义如下：

green
所有的主分片和副本分片都正常运行。
yellow
所有的主分片都正常运行，但不是所有的副本分片都正常运行。
red
有主分片没能正常运行。

创建数据和检索

创建记录

PUT http://192.168.1.243:9200/megacorp/employee/1
-----------------------------索引名称---类型名称---ID--------------------------

{
    "first_name" : "John",
    "last_name" :  "Smith",
    "age" :        25,
    "about" :      "I love to go rock climbing",
    "interests": [ "sports", "music" ]
}

<<

{
    "_index": "megacorp",
    "_type": "employee",
    "_id": "1",
    "_version": 1,
    "result": "created",
    "_shards": {
        "total": 2,
        "successful": 1,
        "failed": 0
    },
    "_seq_no": 0,
    "_primary_term": 1
}

如果想让 ES 自动创建 ID 请把 PUT 方法换成 POST， 自动生成的 ID 是 URL-safe、 基于 Base64 编码且长度为20个字符的 GUID 字符串。 这些 GUID 字符串由可修改的 FlakeID 模式生成，这种模式允许多个节点并行生成唯一 ID ，且互相之间的冲突概率几乎为零。

HEAD 方法检查文档是否存在
更新文档

PUT /website/blog/123
{
  "title": "My first blog entry",
  "text":  "I am starting to get the hang of this...",
  "date":  "2014/01/02"
}

删除文档

DELETE /website/blog/123

ID 查询

GET http://192.168.1.243:9200/megacorp/employee/1

<<

{
    "_index": "megacorp",
    "_type": "employee",
    "_id": "1",
    "_version": 1,
    "_seq_no": 0,
    "_primary_term": 1,
    "found": true,
    "_source": {
        "first_name": "John",
        "last_name": "Smith",
        "age": 25,
        "about": "I love to go rock climbing",
        "interests": [
            "sports",
            "music"
        ]
    }
}

返回文档的一部分 
GET http://192.168.1.243:9200/megacorp/employee/1?_source=first_name,age

只要 _source 不要元数据

GET /website/blog/123/_source

检索

http://192.168.1.243:9200/precisiongenes_search_engine/_search

<<

{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 10000,
            "relation": "gte"
        },
        "max_score": 1.0,
        "hits": [
            {
                "_index": "precisiongenes_search_engine",
                "_type": "_doc",
                "_id": "bd3ad591114e491938b8a7ff8f8932e3",
                "_score": 1.0,
                "_source": {
                    "Chr": "11",
                    "Start": "119216858",
                    "End": "119216858",
                    "Ref": "G",
                    "Alt": "A",
                    "HGMD_acc": "CM1926607",
                    "HGMD_Disease": "Nanophthalmos",
                    "HGMD_tag": "DM",
                    "HGMD_rankscore": "0.99",
                    "HGMD_Base": "CGA-TGA",
                    "HGMD_HGVS": "MFRP:NM_031433.4:c.169C>T:NP_113621.1:p.R57*",
                    "HGMD_Mutation_URL": "CGA-TGA|Arg57Term|c.169C>T|- (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=31589614&dopt=Abstract)",
                    "HGMD_PmidAll": "NULL\n"
                }
            }
        ]
    }
}

搜索条件

http://192.168.1.243:9200/precisiongenes_search_engine/_search?q=HGMD_Mutation_URL:CGA-TGA

<<

{
    "took": 25,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 10000,
            "relation": "gte"
        },
        "max_score": 6.579259,
        "hits": [
            {
                "_index": "precisiongenes_search_engine",
                "_type": "_doc",
                "_id": "bd3ad591114e491938b8a7ff8f8932e3",
                "_score": 6.579259,
                "_source": {
                    "Chr": "11",
                    "Start": "119216858",
                    "End": "119216858",
                    "Ref": "G",
                    "Alt": "A",
                    "HGMD_acc": "CM1926607",
                    "HGMD_Disease": "Nanophthalmos",
                    "HGMD_tag": "DM",
                    "HGMD_rankscore": "0.99",
                    "HGMD_Base": "CGA-TGA",
                    "HGMD_HGVS": "MFRP:NM_031433.4:c.169C>T:NP_113621.1:p.R57*",
                    "HGMD_Mutation_URL": "CGA-TGA|Arg57Term|c.169C>T|- (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=31589614&dopt=Abstract)",
                    "HGMD_PmidAll": "NULL\n"
                }
            }
        ]
    }
}

查询表达式

GET http://192.168.1.243:9200/precisiongenes_search_engine/_search
BODY
{
    "query" : {
        "match" : {
            "last_name" : "Smith"
        }
    }
}

<<

{
    "took": 7,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 10000,
            "relation": "gte"
        },
        "max_score": 6.579259,
        "hits": [
            {
                "_index": "precisiongenes_search_engine",
                "_type": "_doc",
                "_id": "bd3ad591114e491938b8a7ff8f8932e3",
                "_score": 6.579259,
                "_source": {
                    "Chr": "11",
                    "Start": "119216858",
                    "End": "119216858",
                    "Ref": "G",
                    "Alt": "A",
                    "HGMD_acc": "CM1926607",
                    "HGMD_Disease": "Nanophthalmos",
                    "HGMD_tag": "DM",
                    "HGMD_rankscore": "0.99",
                    "HGMD_Base": "CGA-TGA",
                    "HGMD_HGVS": "MFRP:NM_031433.4:c.169C>T:NP_113621.1:p.R57*",
                    "HGMD_Mutation_URL": "CGA-TGA|Arg57Term|c.169C>T|- (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=31589614&dopt=Abstract)",
                    "HGMD_PmidAll": "NULL\n"
                }
            }
        ]
    }
}

更复杂的搜索

GET /megacorp/employee/_search
{
    "query" : {
        "bool": {
            "must": {
                "match" : {
                    "last_name" : "smith" 
                }
            },
            "filter": {
                "range" : {
                    "age" : { "gt" : 24 } 
                }
            }
        }
    }
}

<<

{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 0.2876821,
        "hits": [
            {
                "_index": "megacorp",
                "_type": "employee",
                "_id": "1",
                "_score": 0.2876821,
                "_source": {
                    "first_name": "John",
                    "last_name": "Smith",
                    "age": 25,
                    "about": "I love to go rock climbing",
                    "interests": [
                        "sports",
                        "music"
                    ]
                }
            }
        ]
    }
}

短语搜索

GET /megacorp/employee/_search
{
    "query" : {
        "match_phrase" : {
            "about" : "rock climbing"
        }
    }
}

<<

{
   ...
   "hits": {
      "total":      1,
      "max_score":  0.23013961,
      "hits": [
         {
            ...
            "_score":         0.23013961,
            "_source": {
               "first_name":  "John",
               "last_name":   "Smith",
               "age":         25,
               "about":       "I love to go rock climbing",
               "interests": [ "sports", "music" ]
            }
         }
      ]
   }
}

高亮搜索

GET /megacorp/employee/_search
{
    "query" : {
        "match_phrase" : {
            "about" : "rock climbing"
        }
    },
    "highlight": {
        "fields" : {
            "about" : {}
        }
    }
}

<<

{
   ...
   "hits": {
      "total":      1,
      "max_score":  0.23013961,
      "hits": [
         {
            ...
            "_score":         0.23013961,
            "_source": {
               "first_name":  "John",
               "last_name":   "Smith",
               "age":         25,
               "about":       "I love to go rock climbing",
               "interests": [ "sports", "music" ]
            },
            "highlight": {
               "about": [
                  "I love to go rock climbing" 
               ]
            }
         }
      ]
   }
}

聚合分析

GET /megacorp/employee/_search
{
// ------- query 可以不带---------
  "query": {
    "match": {
      "last_name": "smith"
    }
  },
// ------------------------------
  "aggs": {
    "all_interests": {
      "terms": { "field": "interests" }
    }
  }
}

<<

{
   ...
   "hits": { ... },
   "aggregations": {
      "all_interests": {
         "buckets": [
            {
               "key":       "music",
               "doc_count": 2
            },
            {
               "key":       "forestry",
               "doc_count": 1
            },
            {
               "key":       "sports",
               "doc_count": 1
            }
         ]
      }
   }
}

聚合分析-分级汇总

查询特定兴趣爱好员工的平均年龄

GET /megacorp/employee/_search
{
    "aggs" : {
        "all_interests" : {
            "terms" : { "field" : "interests" },
            "aggs" : {
                "avg_age" : {
                    "avg" : { "field" : "age" }
                }
            }
        }
    }
}

<<

  ...
  "all_interests": {
     "buckets": [
        {
           "key": "music",
           "doc_count": 2,
           "avg_age": {
              "value": 28.5
           }
        },
        {
           "key": "forestry",
           "doc_count": 1,
           "avg_age": {
              "value": 35
           }
        },
        {
           "key": "sports",
           "doc_count": 1,
           "avg_age": {
              "value": 25
           }
        }
     ]
  }

最牛的功能——搜索

空搜索

GET /_search   GET /_search?timeout=10ms

多索引，多类型

/_search
在所有的索引中搜索所有的类型
/gb/_search
在 gb 索引中搜索所有的类型
/gb,us/_search
在 gb 和 us 索引中搜索所有的文档
/g*,u*/_search
在任何以 g 或者 u 开头的索引中搜索所有的类型
/gb/user/_search
在 gb 索引中搜索 user 类型
/gb,us/user,tweet/_search
在 gb 和 us 索引中搜索 user 和 tweet 类型
/_all/user,tweet/_search
在所有的索引中搜索 user 和 tweet 类型

分页

GET /_search?size=5
GET /_search?size=5&from=5
GET /_search?size=5&from=10

轻量搜索

查询在 tweet 类型中 tweet 字段包含 elasticsearch 单词的所有文档
GET /_all/tweet/_search?q=tweet:elasticsearch

查询在 name 字段中包含 john 并且在 tweet 字段中包含 mary 的文档
+name:john +tweet:mary
但是查询字符串参数所需要的 百分比编码：
GET /_search?q=%2Bname%3Ajohn+%2Btweet%3Amary

+ 前缀表示必须与查询条件匹配。类似地， - 前缀表示一定不与查询条件匹配。没有 + 或者 - 的所有其他条件都是可选的——匹配的越多，文档就越相关。

_all 字段

GET /_search?q=mary
当索引一个文档的时候，Elasticsearch 取出所有字段的值拼接成一个大的字符串，作为 _all 字段进行索引

+name:(mary john) +date:>2014-09-10 +(aggregations geo)

name 字段中包含 mary 或者 john
date 值大于 2014-09-10
_all 字段包含 aggregations 或者 geo

?q=%2Bname%3A(mary+john)+%2Bdate%3A%3E2014-09-10+%2B(aggregations+geo)

映射和分析

基于对字段类型的猜测， Elasticsearch 动态为我们产生一个映射，由于 _all 是默认字段，所以没有提及它。但是我们知道 _all 字段是 string 类型的。

Elasticsearch 支持如下简单域类型：

字符串: string
整数 : byte, short, integer, long
浮点数: float, double
布尔型: boolean
日期: date

这意味着如果你通过引号( "123" )索引一个数字，它会被映射为 string 类型，而不是 long 。但是，如果这个域已经映射为 long ，那么 Elasticsearch 会尝试将这个字符串转化为 long ，如果无法转化，则抛出一个异常。

查看类型 http://36.152.126.130:9200/megacorp/_mapping/

自定义域映射-跳过

index

index 属性控制怎样索引字符串。它可以是下面三个值：

analyzed

首先分析字符串，然后索引它。换句话说，以全文索引这个域。
not_analyzed

索引这个域，所以它能够被搜索，但索引的是精确值。不会对它进行分析。
no

不索引这个域。这个域不会被搜索到。

查询与过滤

过滤情况 ：查询被设置成一个“不评分”或者“过滤”查询；只是简单的检查包含或者排除，这就使得计算起来非常快，结果会被缓存到内存中以便快速读取。
查询情况 ：查询就变成了一个“评分”的查询；不仅仅要找出匹配的文档，还要计算每个匹配文档的相关性，计算相关性使得它们比不评分查询费力的多。同时，查询结果并不缓存。

语句

match_all 查询简单的匹配所有文档。在没有指定查询方式时，它是默认的查询

{ "match_all": {}}

match如果你在一个全文字段上使用 match 查询，在执行查询前，它将用正确的分析器去分析查询字符串；如果在一个精确值的字段上使用它，例如数字、日期、布尔或者一个 not_analyzed 字符串字段，那么它将会精确匹配给定的值。

{ "match": { "tweet": "About Search" }}

{ "match": { "age":    26           }}
{ "match": { "date":   "2014-09-01" }}
{ "match": { "public": true         }}
{ "match": { "tag":    "full_text"  }}

query 对于精确值的查询，你可能需要使用 filter 语句来取代 query，因为 filter 将会被缓存。
multi_match 查询可以在多个字段上执行相同的 match 查询。

{
    "multi_match": {
        "query":    "full text search",
        "fields":   [ "title", "body" ]
    }
}

range 查询找出那些落在指定区间内的数字或者时间

{
    "range": {
        "age": {
            "gte":  20,
            "lt":   30
        }
    }
}

term 查询被用于精确值匹配，这些精确值可能是数字、时间、布尔或者那些 not_analyzed 的字符串

{ "term": { "age":    26           }}
{ "term": { "date":   "2014-09-01" }}
{ "term": { "public": true         }}
{ "term": { "tag":    "full_text"  }}

字符串被索引的时候是被 analyzed 的，有一些关键自负的丢失，不支持精确索引，而且一旦被 analysis 将不能再转变为 not_analyzed

查看 analyzed 的结果：

GET /my_store/_analyze
{
  "field": "productID",
  "text": "XHDK-A-1293-#fJ3"
}

<<<

{
  "tokens" : [ {
    "token" :        "xhdk",
    "start_offset" : 0,
    "end_offset" :   4,
    "type" :         "",
    "position" :     1
  }, {
    "token" :        "a",
    "start_offset" : 5,
    "end_offset" :   6,
    "type" :         "",
    "position" :     2
  }, {
    "token" :        "1293",
    "start_offset" : 7,
    "end_offset" :   11,
    "type" :         "",
    "position" :     3
  }, {
    "token" :        "fj3",
    "start_offset" : 13,
    "end_offset" :   16,
    "type" :         "",
    "position" :     4
  } ]
}

Elasticsearch 用 4 个不同的 token 而不是单个 token 来表示这个 UPC 。
所有字母都是小写的。
丢失了连字符和哈希符#

所以当我们用 term 查询查找精确值 XHDK-A-1293-#fJ3 的时候，找不到任何文档，因为它并不在我们的倒排索引中，正如前面呈现出的分析结果，索引里有四个 token 。

此时只能删除索引重新创建映射然后重建数据来支持精确索引

DELETE /my_store 

PUT /my_store 
{
    "mappings" : {
        "products" : {
            "properties" : {
                "productID" : {
                    "type" : "string",
                    "index" : "not_analyzed" 
                }
            }
        }
    }

}

terms 查询和 term 查询一样，但它允许你指定多值进行匹配。**一定要了解 term 和 terms 是 包含（contains） 操作，而非 等值（equals） （判断）。 **精确相等的话，解决方法是创建另一个辅助字段。

{ "terms": { "tag": [ "search", "full_text", "nosql" ] }}

{
    "terms" : {
        "price" : [20, 30]
    }
}

exists 查询和 missing 查询被用于查找那些指定字段中有值 (exists) 或无值 (missing) 的文档。这与SQL中的 IS_NULL (missing) 和 NOT IS_NULL (exists) 在本质上具有共性,这些查询经常用于某个字段有值的情况和某个字段缺值的情况。

{
    "exists":   {
        "field":    "title"
    }
}

组合多查询
- must
  
  文档必须匹配这些条件才能被包含进来。
- must_not
  
  文档 必须不 匹配这些条件才能被包含进来。
- should
  
  如果满足这些语句中的任意语句，将增加 _score ，否则，无任何影响。它们主要用于修正每个文档的相关性得分。
- filter
  
  必须匹配，但它以不评分、过滤模式来进行。这些语句对评分没有贡献，只是根据过滤标准来排除或包含文档。


复合(Compound) 语句 主要用于 合并其它查询语句。 比如，一个 bool 语句 允许在你需要的时候组合其它语句，无论是 must 匹配、 must_not 匹配还是 should 匹配，同时它可以包含不评分的过滤器（filters）

{
    "bool": {
        "must":     { "match": { "tweet": "elasticsearch" }},
        "must_not": { "match": { "name":  "mary" }},
        "should":   { "match": { "tweet": "full text" }},
        "filter":   { "range": { "age" : { "gt" : 30 }} }
    }
}

找出信件正文包含 business opportunity 的星标邮件，或者在收件箱正文包含 business opportunity 的非垃圾邮件
{
    "bool": {
        "must": { "match":   { "email": "business opportunity" }},
        "should": [
            { "match":       { "starred": true }},
            { "bool": {
                "must":      { "match": { "folder": "inbox" }},
                "must_not":  { "match": { "spam": true }}
            }}
        ],
        "minimum_should_match": 1
    }
}

如果我们不想因为文档的时间而影响得分，可以用 filter 语句来重写前面的例子

{
    "bool": {
        "must":     { "match": { "title": "how to make millions" }},
        "must_not": { "match": { "tag":   "spam" }},
        "should": [
            { "match": { "tag": "starred" }}
        ],
        "filter": {
          "range": { "date": { "gte": "2014-01-01" }} 
        }
    }
}

range 查询已经从 should 语句中移到 filter 语句, 通过将 range 查询移到 filter 语句中，我们将它转成不评分的查询，将不再影响文档的相关性排名。由于它现在是一个不评分的查询，可以使用各种对 filter 查询有效的优化手段来提升性能。

{
    "bool": {
        "must":     { "match": { "title": "how to make millions" }},
        "must_not": { "match": { "tag":   "spam" }},
        "should": [
            { "match": { "tag": "starred" }}
        ],
        "filter": {
          "bool": { 
              "must": [
                  { "range": { "date": { "gte": "2014-01-01" }}},
                  { "range": { "price": { "lte": 29.99 }}}
              ],
              "must_not": [
                  { "term": { "category": "ebooks" }}
              ]
          }
        }
    }
}

constant_score term 查询被放置在 constant_score 中，转成不评分的 filter。这种方式可以用来取代只有 filter 语句的 bool 查询。
```
{
    "constant_score":   {
        "filter": {
            "term": { "category": "ebooks" } 
        }
    }
}
```
验证查询

validate-query API 可以用来验证查询是否合法。

GET /gb/tweet/_validate/query?explain 
{
   "query": {
      "tweet" : {
         "match" : "really powerful"
      }
   }
}

<<

{
  "valid" :     false,
  "_shards" :   { ... },
  "explanations" : [ {
    "index" :   "gb",
    "valid" :   false,
    "error" :   "org.elasticsearch.index.query.QueryParsingException:
                 [gb] No query registered for [tweet]"
  } ]
}

排序

ES 默认通过 _score 降序排列，可以调整排序字段，如果调整到其他字段就不再计算 score 值。

排序

GET /_search
{
    "query" : {
        "bool" : {
            "filter" : { "term" : { "user_id" : 1 }}
        }
    },
    "sort": { "date": { "order": "desc" }}
}

多级排序，结果首先按第一个条件排序，仅当结果集的第一个 sort 值完全相同时才会按照第二个条件进行排序，以此类推。

GET /_search
{
    "query" : {
        "bool" : {
            "must":   { "match": { "tweet": "manage text search" }},
            "filter" : { "term" : { "user_id" : 2 }}
        }
    },
    "sort": [
        { "date":   { "order": "desc" }},
        { "_score": { "order": "desc" }}
    ]
}


Query-string 搜索 也支持自定义排序，可以在查询字符串中使用 sort 参数：

GET /_search?sort=date:desc&sort=_score&q=search

多值字段排序，对于数字或日期，你可以将多值字段减为单值，这可以通过使用 min 、 max 、 avg 或是 sum 排序模式 。

"sort": {
    "dates": {
        "order": "asc",
        "mode":  "min"
    }
}

2022-04-21 ElasticSearch 学习笔记