Search API

URI Search
- 在URL中使用查询参数
Request Body Search
- 使用Elasticsearch提供的，基于JSON格式的更加完备的Query Domain Specific Language(DSL)

指定查询的索引

语法	范围
/_search	集群上所有的索引
/index1/_search	index1
/index1,index2/_search	index1和index2
/index*/_search	以index开头的索引

URI查询

使用"q"，指定查询字符串
"query string syntax", KV 键值对

例如：

curl -XGET "http://elasticsearch:9200/kibana_sample_data_ecommerce/_search?q=customer_first_name:Eddie"

其中kibana_sample_data_ecommerce为指定的index，q用来表示查询的内容，搜索叫做Eddie的客户，customer_first_name为K，Eddie为V

Request Body

例如：

curl -XGET "http://elasticsearch:9200/kibana_sample_data_ecommerce/_search" -H
'Content-Type: application/json' -d '
{
  "query":{
  "match_all":{}
}
}'

其中kibana_sample_data_ecommerce为索引名，_search为执行搜索的操作 query为查询，match_all为查询所有，返回所有文档

搜索Response

{
  "took" : 10    --- took:花费的时间
  "time_out": false,
  "_shards": {
    "total": 1,
    "skipped":0,
    "failed":0
  },
  "hits": {
    "total":4675,  --- total: 符合条件的总文档数
    "max_score": 1,
    "hits": [     --- 结果集，默认前10个文档
      {
        "_index":"kibana_sample_data_ecommerce",   --- 索引名称
        "_type" : "_doc",
        "_id" : "CbLRW2kBi-meog",          --- 文档ID
        "_score":1,       --- 相关度评分
        "_source"：{    --- 文档原始信息
          "category": ["Men's Clothing"],
          "currency":"EUR",
          "customer_first_name":"Eddie"
          }
      }
    ]
  }
}

URI Search - 通过URI query实现搜索

GET /movies/_search?q=2012&df=title&sort=year:desc&from=0&size=10&timeout=1s
{
  "profile":true
}
// 与上方语句相同 不使用df指定key，则q后用k：v
GET /movies/_search?q=title:2012&sort=year:desc&from=0&size=10&timeout=1s
{
  "profile":true
}

q 指定查询语句，使用Query String Syntax
df 默认字段，不指定时，对所有字段进行查询
sort排序 / from 和 size 用于分页
profile 可以查看查询是如何被执行的

Query String Syntax(1)

指定字段v.s泛查询
- q=title:2012 / q=2012
Term v.s Phrase
- Beautiful Mind 等效于 Beautiful OR Mind 要用（）括起来
- "Beautiful Mind",等效于Beautiful AND Mind. Phrase查询，还要求前后顺序保持一致
分组与引号
- title:(Beautiful AND Mind)
- title="Beautiful Mind"

Query String Syntax(2)

布尔操作
- AND/ OR/ NOT 或者 &&/ ||/ ！
  - 必须大写
  - title:(matrix NOT reloaded)
- 分组
  - + 表示 must 也可用%2B
  - - 表示 must_not
  - title:(+matrix -reloaded)

Query String Syntax(3)

范围查询
* 区间表示：[]闭区间，{}开区间
* year:{2019 TO 2018}
* year:[* TO 2018]
算数符号
- year:>2010
- year:(>2010&&<=2018)
- year:(+>2010 +<=2018)

Query String Syntax(4)

通配符查询（通配符查询效率低，占用内存大，不建议使用。特别是放在最前面）
- ?代表1个字符，* 代表0或多个字符
  - title:mi?d
  - title:be*
正则表达
- title:[bt]oy
模糊匹配与近似查询
- title:befutifl~1
- title:"lord rings"~2

Request Body Search

将查询语句通过HTTP Request Body发送给Elasticsearch
Query DSL

POST /movies,404_idx/_search?ignore_unavailable=true
{
  "profile":true
  "query":{
    "_source":["a","b","c"]   --- 要查询的字段,如果_source中没有存储内容，则只返回匹配的文档的元数据，支持通配符
    "sort":[{"order_date":"desc"}] --- 排序
    "from":10,
    "size":20,
    "match_all":{}
  }
}

最好在“数字型”与"日期型"字段上排序
因为对于多值类型或分析过的字段排序，系统会选一个值，无法得知该值

脚本字段

GET kibana_sample_data_ecommerce/_search
{
  "script_fields": {
    "new_field":{
       "script": {
          “lang”: "painless",
          "source":"doc['order_date'].value+'_hello'"
        }
     }
  },
  "from":10,
  "size":5,
  "query":{
    "match_all":{}
   }
}

使用场景：订单中有不同的汇率，需要结合汇率对订单价格进行排序或计算出新的值

使用查询表达式 - Match

// 出现Last或Christmas或Last Christmas
GET /comments/_doc/_search
{
  "query":{
    "match": {
      "comment": "Last Christmas"
    }
  }
}
// 同时出现 Last 和 Chrismas
GET /comments/_doc/_search
{
  ”query“:{
    "match":{
      "comment": {
        "query":"Last Chrismas"，
        ”operator“: "AND"
      }
    }
  }
}

短语搜索 - Match Phrase

通过设置slop为1，查询出现'SongLast Chrismas'短语的结果

GET /comments/_doc/_search
{
  "query":{
    "match_phrase":{
      "comment": {
        "query":"SongLast Chrismas",
        "slop": 1
      }
    }  
  }
}

Query String Query

类似URI Query

 POST users/_search
{
  "query": {
    "query_string" : {
      "default_field":"name",
      "query":"Ruan AND Yiming"
    }
  }
}

POST users/_search
{
  "query": {
    "query_string": {
      "fields":["name", "about"],
      "query" : "(Ruan AND Yiming) OR (Java AND Elasticsearch)"
    }
  }
}

Simple Query String Query

类似Query String,但是会忽略错误的语法，同时只支持部分查询语法
不支持AND OR NOT, 会当作字符串处理,可以通过指定default_operator来使用对应的功能
Term之间默认的关系是OR，可以指定Operator
支持部分逻辑
- +替代AND
- |替代 OR
- -替代NOT

POST users/_search
{
  "query": {
    "simple_query_string": {
      "query": "Ruan - Yiming",
      "fields": ["name"],
      "default_operator": "AND"
    }
  }
}

Mapping

Mapping 类似数据库中的schema的定义，作用如下
- 定义索引中的字段的名称
- 定义字段的数据类型，例如字符串，数字，布尔......
- 字段，倒排索引的相关配置，（Analyzed or Not Analyzed, Analyzer）
Mapping 会把JSON文档映射成Lucene所需要的扁平格式
一个Mapping属于一个索引的Type
- 每个文档都属于一个Type
- 一个Type有一个Mapping定义
- 7.0开始，不需要在Mapping定义中指定type信息

字段的数据类型

简单类型
- Text/Keyword
- Date
- Integer / Floating
- Boolean
- IPv4 & IPv6
复杂类型 - 对象和嵌套对象
- 对象类型/ 嵌套类型
特殊类型
- geo_point & geo_shape / percolator

Dynamic Mapping

在写入文档时候，如果索引不存在，会自动创建索引
Dynamic Mapping的机制，使得我们无需手动定义Mappings.Elasticsearch会自动根据文档信息，推算出字段的类型
但是有时候会推算不对，例如地理位置信息
当类型如果设置不对时，会导致一些功能无法正常使用，例如Range查询

类型自动识别

JSON类型	Elasticsearch类型
字符串	匹配日期格式，设置成Date 配置数字设置为float或者long，该选项默认关闭设置为Text,并且增加keyword子字段
布尔值	boolean
浮点数	float
整数	long
对象	Object
数组	由第一个非空数值的类型所决定
空值	忽略

demo

// 写入文档
PUT mapping_test/_doc/1
{
  "firstName": "Chan",
  "lastName": "Jackie",
  "loginDate": "2018-07-24T10:29:48.103Z"
}
// 查看Mapping文件
GET mapping_test/_mapping
// GET返回值
{
  "mapping_test" : {
    "mappings" : {
      "properties" : {
        "firstName" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "lastName" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "loginDate" : {
          "type" : "date"
        }
      }
    }
  }
}
// Delete index
DELETE mapping_test

// dynamic mapping,推断字段类型
PUT mapping_test/_doc/1
{
  "uid":"123",
  "isVip": false,
  "isAdmin": "true",
  "age":19,
  "heigh":180
}
GET mapping_test/_mapping
// 返回值
{
  "mapping_test" : {
    "mappings" : {
      "properties" : {
        "age" : {
          "type" : "long"
        },
        "heigh" : {
          "type" : "long"
        },
        "isAdmin" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "isVip" : {
          "type" : "boolean"
        },
        "uid" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
  }
}

能否更改Mapping的字段类型

两种情况
- 新增加字段
  - Dynamic设为true时，一旦有新增字段的文档写入，Mapping也同时被更新
  - Dynamic设为false，Mapping不会被更新，新增字段的数据无法被索引，但是信息会出现在_source中
  - Dynamic设置为Strict，文档写入失败
- 对已有字段，一旦已经有数据写入，就不再支持修改字段定义
  - Lucene 实现的倒排索引，一旦生成后，就不允许修改
- 如果希望改变字段类型，必须Reindex API，重建索引
原因
- 如果修改了字段的数据类型，会导致已被索引的数据无法被搜索
- 但是如果时增加新的字段，就不会有这样的影响

控制Dynamic Mappings

	"true"	"false"	"strict"
文档可索引	YES	YES	NO
字段可索引	YES	NO	NO
Mapping被更新	YES	NO	NO

PUT movies
{
  "mappings": {
    "_doc": {
      "dynamic": "false"
    }
  }
}

当dynamic被设置成false时，存在新增字段的数据写入，该数据可以被索引，但是新增字段被丢弃
当设置成Strict模式时，数据写入直接出错

显示定义一个Mapping

PUT movies
{
  "mappings": {
    //...
  }
}

控制当前字段是否被索引

index - 控制当前字段是否被索引。默认为true。如果设置为false，该字段不可被搜索

PUT users
{
  "mappings": {
    "properties": {
        "firstName": {
          "type": "text"
        },
        "lastName": {
          "type": "text"
        },
        "mobile": {
          "type": "text",
          "index": false                     --- 无法通过手机号进行搜索
        }
     }
  }
}

Index Options

四种不同级别的Index Options配置，可以控制倒排索引记录的内容
- docs - 记录 doc id
- freqs - 记录doc id 和 term frequencies
- positions - 记录 doc id / term frequencies / term position
- offsets - doc id / term frequencies/ term posistion / character offects

Null_value

需要对Null值实现搜索
只有Keyword类型支持设定Null_Value

GET users/_search?q=mobile:NULL
PUT users
{
    "mappings": {
    "properties": {
        "firstName": {
          "type": "text"
        },
        "lastName": {
          "type": "text"
        },
        "mobile": {
          "type": "keyword",
          "null_value": "NULL"
        }
     }
  }
}

// 返回
"_source": {
  "firstName": "Ruan",
 "lastName": "Yiming",
 "mobile": null
}

copy_to 设置

_all 在7中被copy_to所替代
满足一些特定的搜索需求
copy_to将字段的数值拷贝到目标字段，实现类似_all的作用
copy的目标字段不出现在_source中

PUT users
{
    "mappings": {
    "properties": {
        "firstName": {
          "type": "text",
          "copy_to":"fullName"
        },
        "lastName": {
          "type": "text",
          "copy_to": "fullName"
        }
     }
  }
}
// 查询时
GET users/_search?q=fullName:(Ruan YIming)

多字段类型

多字段特性
- 厂商名字实现精确匹配
  - 增加一个keyword字段
- 使用不同的analyzer
  - 不同语言
  - 拼音字段的搜索
  - 还支持为搜索和索引指定不同的analyzer

PUT products
{
"mappings" : {
  "properties": {
    "company": {
      "type": "text",
      "fields": {
        "keyword": {
          "type": "keyword",
          "ignore_above": 256
        }
      }
     },
    "comment": {
      "type": "text",
      "fields": {
        "english_comment": {
          "type" : "text",
          "analyzer": "english",
          "search_analyzer": "english"
        }
      }
    }
  }
}
}

Exact Values vs Full Text

Excat values: 包括数字/ 日期/ 具体一个字符串（例如“Apple Store”）
- Elasticsearch中的keyword
- 不需要做特殊的分词处理
Full Text:全文本，非结构化的文本数据
- Elasticsearch中的text

自定义分词

当Elasticsearch自带的分词器无法满足时，可以自定义分词器。通过自组合不同的组件实现
- Character Filter
- Tokenizer
- Token Filter

Character Filters

在Tokenizer之前对文本进行处理，例如增加删除及替换字符。可以配置多个Character Filters.会影响Tokenizer的position和offset信息
一些自带的Character Filters
- HTML strip - 去除html标签

 POST _analyze
{
  "tokenizer": "keyword",
  "char_filter": ["html_strip"],
  "text": "hello world"
}
// 返回值
{
  "token": [
    "token": "hello world",
    "start_offset": 3,
    "end_offset": 18,
    "type": "word",
    "position": 0
  ]
}

Mapping - 字符串替换

POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [
    {
      "type": "mapping",
      "mappings": ["- => _"]
    }
  ],
  "text": "123-456, I-test! test-990 650-555-1234"
}
// 返回值
{
  "tokens" : [
    {
      "token" : "123_456",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "",
      "position" : 0
    },
    {
      "token" : "I_test",
      "start_offset" : 9,
      "end_offset" : 15,
      "type" : "",
      "position" : 1
    },
    {
      "token" : "test_990",
      "start_offset" : 17,
      "end_offset" : 25,
      "type" : "",
      "position" : 2
    },
    {
      "token" : "650_555_1234",
      "start_offset" : 26,
      "end_offset" : 38,
      "type" : "",
      "position" : 3
    }
  ]
}

// 替换表情
POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [
    {
    "type": "mapping",
    "mappings":[":) => happy", ":( => sad"]
    }
    ],
    "text": ["I am felling :)", "Feeling :( today"]
}
// 返回值
{
  "tokens" : [
    {
      "token" : "I",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "",
      "position" : 0
    },
    {
      "token" : "am",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "",
      "position" : 1
    },
    {
      "token" : "felling",
      "start_offset" : 5,
      "end_offset" : 12,
      "type" : "",
      "position" : 2
    },
    {
      "token" : "happy",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "",
      "position" : 3
    },
    {
      "token" : "Feeling",
      "start_offset" : 16,
      "end_offset" : 23,
      "type" : "",
      "position" : 104
    },
    {
      "token" : "sad",
      "start_offset" : 24,
      "end_offset" : 26,
      "type" : "",
      "position" : 105
    },
    {
      "token" : "today",
      "start_offset" : 27,
      "end_offset" : 32,
      "type" : "",
      "position" : 106
    }
  ]
}

Pattern replace - 正则匹配替换

GET _analyze
{
  "tokenizer": "standard",
  "char_filter": [
    {
      "type": "pattern_replace",
      "pattern": "http://(.*)",
      "replacement": "$1"
    }
    ],
    "text": "http://www.elastic.co"
}
// 返回
{
  "tokens" : [
    {
      "token" : "www.elastic.co",
      "start_offset" : 0,
      "end_offset" : 21,
      "type" : "",
      "position" : 0
    }
  ]
}

Tokenizer

将原始的文本按照一定的规则，切分为词（term or token）
Elasticsearch内置的Tokenizers
- whitespace / standard / uax_url_email / pattern / keyword / path hierarchy
可以自己实现Tokenizer

POST _analyze
{
  "tokenizer": "path_hierarchy",
  "text": "/user/ymruan/a/b/c/d/e"
}

// 返回
{
  "tokens" : [
    {
      "token" : "/user",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/user/ymruan",
      "start_offset" : 0,
      "end_offset" : 12,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/user/ymruan/a",
      "start_offset" : 0,
      "end_offset" : 14,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/user/ymruan/a/b",
      "start_offset" : 0,
      "end_offset" : 16,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/user/ymruan/a/b/c",
      "start_offset" : 0,
      "end_offset" : 18,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/user/ymruan/a/b/c/d",
      "start_offset" : 0,
      "end_offset" : 20,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/user/ymruan/a/b/c/d/e",
      "start_offset" : 0,
      "end_offset" : 22,
      "type" : "word",
      "position" : 0
    }
  ]
}

Token Filters

将Tokenizer输出的单词（term），进行增加，修改，删除
自带的Token Filters
- Lowercase / stop/ synonym(添加近义词)

GET analyze
{
  "tokenizer": "whitespace",
  "filter": ["stop"],
  "text": ["The rain in Spain falls mainly on the plain."]
}

// 返回值
{
  "tokens" : [
    {
      "token" : "The",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "rain",
      "start_offset" : 4,
      "end_offset" : 8,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "Spain",
      "start_offset" : 12,
      "end_offset" : 17,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "falls",
      "start_offset" : 18,
      "end_offset" : 23,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "mainly",
      "start_offset" : 24,
      "end_offset" : 30,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "plain.",
      "start_offset" : 38,
      "end_offset" : 44,
      "type" : "word",
      "position" : 8
    }
  ]
}

设置一个Custom Analyzer

{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english": {
          "type": "english",
          "stem_exclusion": ["organization", "organizations"],
          "stopwords": [
            "a","an","and","are","as","at","be","but","by","for","if","in","into","is","it","of","on","or",
            "such", "that", "the", "their", "then", "there", "these", "they", "this", "to","was","will","with"
          ]
        }
      }
    }
  }
}

PUT mapping_test/_doc/1
{
  "firstName": "Chan",
  "lastName": "Jackie",
  "loginDate": "2018-07-24T10:29:48.103Z"
}
GET mapping_test/_mapping
DELETE mapping_test
PUT mapping_test/_doc/1
{
  "uid":"123",
  "isVip": false,
  "isAdmin": "true",
  "age":19,
  "heigh":180
}

POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [
    {
      "type": "mapping",
      "mappings": ["- => _"]
    }
  ],
  "text": "123-456, I-test! test-990 650-555-1234"
}

POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [
    {
    "type": "mapping",
    "mappings":[":) => happy", ":( => sad"]
    }
    ],
    "text": ["I am felling :)", "Feeling :( today"]
}

GET _analyze
{
  "tokenizer": "standard",
  "char_filter": [
    {
      "type": "pattern_replace",
      "pattern": "http://(.*)",
      "replacement": "$1"
    }
    ],
    "text": "http://www.elastic.co"
}

POST _analyze
{
  "tokenizer": "path_hierarchy",
  "text": "/user/ymruan/a/b/c/d/e"
}

GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["stop"],
  "text": ["The rain in Spain falls mainly on the plain."]
}

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "char_filter": [
            "emoticons"
            ],
            "tokenizer":"punctuation",
            "filter":[
              "lowercase",
              "english_stop"
              ]
        }
      },
      "tokenizer": {
        "punctuation": {
          "type": "pattern",
          "pattern": "[.,!?]"
        }
      },
      "char_filter": {
        "emoticons": {
          "type": "mapping",
          "mappings": [
            ":) => _happy_",
            ":( => _sad_"
            ]
        }
      },
      "filter": {
        "english_stop": {
          "type": "stop",
          "stopwords": "_english_"
        }
      }
    }
  }
}

// 返回
{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "my_index"
}

POST my_index/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text": "I'm a :) person, and you ?"
}
// 返回
{
  "tokens" : [
    {
      "token" : "i'm a _happy_ person",
      "start_offset" : 0,
      "end_offset" : 15,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : " and you ",
      "start_offset" : 16,
      "end_offset" : 25,
      "type" : "word",
      "position" : 1
    }
  ]
}

笔记二