笔记二

Search API

  • URI Search
    • 在URL中使用查询参数
  • Request Body Search
    • 使用Elasticsearch提供的,基于JSON格式的更加完备的Query Domain Specific Language(DSL)

指定查询的索引

语法 范围
/_search 集群上所有的索引
/index1/_search index1
/index1,index2/_search index1和index2
/index*/_search 以index开头的索引

URI查询

  • 使用"q",指定查询字符串
  • "query string syntax", KV 键值对

例如:

curl -XGET "http://elasticsearch:9200/kibana_sample_data_ecommerce/_search?q=customer_first_name:Eddie" 

其中kibana_sample_data_ecommerce为指定的index,q用来表示查询的内容,搜索叫做Eddie的客户,customer_first_name为K,Eddie为V

Request Body

例如:

curl -XGET "http://elasticsearch:9200/kibana_sample_data_ecommerce/_search" -H
'Content-Type: application/json' -d '
{
  "query":{
  "match_all":{}
}
}'

其中kibana_sample_data_ecommerce为索引名,_search为执行搜索的操作 query为查询,match_all为查询所有,返回所有文档

搜索Response

{
  "took" : 10    --- took:花费的时间
  "time_out": false,
  "_shards": {
    "total": 1,
    "skipped":0,
    "failed":0
  },
  "hits": {
    "total":4675,  --- total: 符合条件的总文档数
    "max_score": 1,
    "hits": [     --- 结果集,默认前10个文档
      {
        "_index":"kibana_sample_data_ecommerce",   --- 索引名称
        "_type" : "_doc",
        "_id" : "CbLRW2kBi-meog",          --- 文档ID
        "_score":1,       --- 相关度评分
        "_source":{    --- 文档原始信息
          "category": ["Men's Clothing"],
          "currency":"EUR",
          "customer_first_name":"Eddie"
          }
      }
    ]
  }
}

URI Search - 通过URI query实现搜索

GET /movies/_search?q=2012&df=title&sort=year:desc&from=0&size=10&timeout=1s
{
  "profile":true
}
// 与上方语句相同 不使用df指定key,则q后用k:v
GET /movies/_search?q=title:2012&sort=year:desc&from=0&size=10&timeout=1s
{
  "profile":true
}
  • q 指定查询语句,使用Query String Syntax
  • df 默认字段,不指定时,对所有字段进行查询
  • sort排序 / from 和 size 用于分页
  • profile 可以查看查询是如何被执行的

Query String Syntax(1)

  • 指定字段v.s泛查询
    • q=title:2012 / q=2012
  • Term v.s Phrase
    • Beautiful Mind 等效于 Beautiful OR Mind 要用()括起来
    • "Beautiful Mind",等效于Beautiful AND Mind. Phrase查询,还要求前后顺序保持一致
  • 分组与引号
    • title:(Beautiful AND Mind)
    • title="Beautiful Mind"

Query String Syntax(2)

  • 布尔操作
    • AND/ OR/ NOT 或者 &&/ ||/ !
      • 必须大写
      • title:(matrix NOT reloaded)
    • 分组
      • + 表示 must 也可用%2B
      • - 表示 must_not
      • title:(+matrix -reloaded)

Query String Syntax(3)

  • 范围查询
    * 区间表示:[]闭区间,{}开区间
    * year:{2019 TO 2018}
    * year:[* TO 2018]
  • 算数符号
    • year:>2010
    • year:(>2010&&<=2018)
    • year:(+>2010 +<=2018)

Query String Syntax(4)

  • 通配符查询(通配符查询效率低,占用内存大,不建议使用。特别是放在最前面)
    • ?代表1个字符,* 代表0或多个字符
      • title:mi?d
      • title:be*
  • 正则表达
    • title:[bt]oy
  • 模糊匹配与近似查询
    • title:befutifl~1
    • title:"lord rings"~2

Request Body Search

  • 将查询语句通过HTTP Request Body发送给Elasticsearch
  • Query DSL
POST /movies,404_idx/_search?ignore_unavailable=true
{
  "profile":true
  "query":{
    "_source":["a","b","c"]   --- 要查询的字段,如果_source中没有存储内容,则只返回匹配的文档的元数据,支持通配符
    "sort":[{"order_date":"desc"}] --- 排序
    "from":10,
    "size":20,
    "match_all":{}
  }
}
  • 最好在“数字型”与"日期型"字段上排序
  • 因为对于多值类型或分析过的字段排序,系统会选一个值,无法得知该值

脚本字段

GET kibana_sample_data_ecommerce/_search
{
  "script_fields": {
    "new_field":{
       "script": {
          “lang”: "painless",
          "source":"doc['order_date'].value+'_hello'"
        }
     }
  },
  "from":10,
  "size":5,
  "query":{
    "match_all":{}
   }
}

使用场景:订单中有不同的汇率,需要结合汇率对订单价格进行排序或计算出新的值

使用查询表达式 - Match

// 出现Last或Christmas或Last Christmas
GET /comments/_doc/_search
{
  "query":{
    "match": {
      "comment": "Last Christmas"
    }
  }
}
// 同时出现 Last 和 Chrismas
GET /comments/_doc/_search
{
  ”query“:{
    "match":{
      "comment": {
        "query":"Last Chrismas",
        ”operator“: "AND"
      }
    }
  }
}

短语搜索 - Match Phrase

  • 通过设置slop为1,查询出现'SongLast Chrismas'短语的结果
GET /comments/_doc/_search
{
  "query":{
    "match_phrase":{
      "comment": {
        "query":"SongLast Chrismas",
        "slop": 1
      }
    }  
  }
}

Query String Query

  • 类似URI Query
 POST users/_search
{
  "query": {
    "query_string" : {
      "default_field":"name",
      "query":"Ruan AND Yiming"
    }
  }
}

POST users/_search
{
  "query": {
    "query_string": {
      "fields":["name", "about"],
      "query" : "(Ruan AND Yiming) OR (Java AND Elasticsearch)"
    }
  }
}

Simple Query String Query

  • 类似Query String,但是会忽略错误的语法,同时只支持部分查询语法
  • 不支持AND OR NOT, 会当作字符串处理,可以通过指定default_operator来使用对应的功能
  • Term之间默认的关系是OR, 可以指定Operator
  • 支持部分逻辑
    • +替代AND
    • |替代 OR
    • -替代NOT
POST users/_search
{
  "query": {
    "simple_query_string": {
      "query": "Ruan - Yiming",
      "fields": ["name"],
      "default_operator": "AND"
    }
  }
}

Mapping

  • Mapping 类似数据库中的schema的定义,作用如下
    • 定义索引中的字段的名称
    • 定义字段的数据类型,例如字符串,数字,布尔......
    • 字段,倒排索引的相关配置,(Analyzed or Not Analyzed, Analyzer)
  • Mapping 会把JSON文档映射成Lucene所需要的扁平格式
  • 一个Mapping属于一个索引的Type
    • 每个文档都属于一个Type
    • 一个Type有一个Mapping定义
    • 7.0开始,不需要在Mapping定义中指定type信息

字段的数据类型

  • 简单类型
    • Text/Keyword
    • Date
    • Integer / Floating
    • Boolean
    • IPv4 & IPv6
  • 复杂类型 - 对象和嵌套对象
    • 对象类型/ 嵌套类型
  • 特殊类型
    • geo_point & geo_shape / percolator

Dynamic Mapping

  • 在写入文档时候,如果索引不存在,会自动创建索引
  • Dynamic Mapping的机制,使得我们无需手动定义Mappings.Elasticsearch会自动根据文档信息,推算出字段的类型
  • 但是有时候会推算不对,例如地理位置信息
  • 当类型如果设置不对时,会导致一些功能无法正常使用,例如Range查询

类型自动识别

JSON类型 Elasticsearch类型
字符串 匹配日期格式,设置成Date
配置数字设置为float或者long,该选项默认关闭
设置为Text,并且增加keyword子字段
布尔值 boolean
浮点数 float
整数 long
对象 Object
数组 由第一个非空数值的类型所决定
空值 忽略
  • demo
// 写入文档
PUT mapping_test/_doc/1
{
  "firstName": "Chan",
  "lastName": "Jackie",
  "loginDate": "2018-07-24T10:29:48.103Z"
}
// 查看Mapping文件
GET mapping_test/_mapping
// GET返回值
{
  "mapping_test" : {
    "mappings" : {
      "properties" : {
        "firstName" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "lastName" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "loginDate" : {
          "type" : "date"
        }
      }
    }
  }
}
// Delete index
DELETE mapping_test

// dynamic mapping,推断字段类型
PUT mapping_test/_doc/1
{
  "uid":"123",
  "isVip": false,
  "isAdmin": "true",
  "age":19,
  "heigh":180
}
GET mapping_test/_mapping
// 返回值
{
  "mapping_test" : {
    "mappings" : {
      "properties" : {
        "age" : {
          "type" : "long"
        },
        "heigh" : {
          "type" : "long"
        },
        "isAdmin" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "isVip" : {
          "type" : "boolean"
        },
        "uid" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
  }
}

能否更改Mapping的字段类型

  • 两种情况
    • 新增加字段
      • Dynamic设为true时,一旦有新增字段的文档写入,Mapping也同时被更新
      • Dynamic设为false,Mapping不会被更新,新增字段的数据无法被索引,但是信息会出现在_source中
      • Dynamic设置为Strict, 文档写入失败
    • 对已有字段,一旦已经有数据写入,就不再支持修改字段定义
      • Lucene 实现的倒排索引,一旦生成后,就不允许修改
    • 如果希望改变字段类型,必须Reindex API,重建索引
  • 原因
    • 如果修改了字段的数据类型,会导致已被索引的数据无法被搜索
    • 但是如果时增加新的字段, 就不会有这样的影响

控制Dynamic Mappings

"true" "false" "strict"
文档可索引 YES YES NO
字段可索引 YES NO NO
Mapping被更新 YES NO NO
PUT movies
{
  "mappings": {
    "_doc": {
      "dynamic": "false"
    }
  }
}
  • 当dynamic被设置成false时,存在新增字段的数据写入,该数据可以被索引,但是新增字段被丢弃
  • 当设置成Strict模式时,数据写入直接出错

显示定义一个Mapping

PUT movies
{
  "mappings": {
    //...
  }
}

控制当前字段是否被索引

  • index - 控制当前字段是否被索引。默认为true。如果设置为false,该字段不可被搜索
PUT users
{
  "mappings": {
    "properties": {
        "firstName": {
          "type": "text"
        },
        "lastName": {
          "type": "text"
        },
        "mobile": {
          "type": "text",
          "index": false                     --- 无法通过手机号进行搜索
        }
     }
  }
}

Index Options

  • 四种不同级别的Index Options配置,可以控制倒排索引记录的内容
    • docs - 记录 doc id
    • freqs - 记录doc id 和 term frequencies
    • positions - 记录 doc id / term frequencies / term position
    • offsets - doc id / term frequencies/ term posistion / character offects

Null_value

  • 需要对Null值实现搜索
  • 只有Keyword类型支持设定Null_Value
GET users/_search?q=mobile:NULL
PUT users
{
    "mappings": {
    "properties": {
        "firstName": {
          "type": "text"
        },
        "lastName": {
          "type": "text"
        },
        "mobile": {
          "type": "keyword",
          "null_value": "NULL"
        }
     }
  }
}

// 返回
"_source": {
  "firstName": "Ruan",
 "lastName": "Yiming",
 "mobile": null
}

copy_to 设置

  • _all 在7中被copy_to所替代
  • 满足一些特定的搜索需求
  • copy_to将字段的数值拷贝到目标字段,实现类似_all的作用
  • copy的目标字段不出现在_source中
PUT users
{
    "mappings": {
    "properties": {
        "firstName": {
          "type": "text",
          "copy_to":"fullName"
        },
        "lastName": {
          "type": "text",
          "copy_to": "fullName"
        }
     }
  }
}
// 查询时
GET users/_search?q=fullName:(Ruan YIming)

多字段类型

  • 多字段特性
    • 厂商名字实现精确匹配
      • 增加一个keyword字段
    • 使用不同的analyzer
      • 不同语言
      • 拼音字段的搜索
      • 还支持为搜索和索引指定不同的analyzer
PUT products
{
"mappings" : {
  "properties": {
    "company": {
      "type": "text",
      "fields": {
        "keyword": {
          "type": "keyword",
          "ignore_above": 256
        }
      }
     },
    "comment": {
      "type": "text",
      "fields": {
        "english_comment": {
          "type" : "text",
          "analyzer": "english",
          "search_analyzer": "english"
        }
      }
    }
  }
}
}

Exact Values vs Full Text

  • Excat values: 包括数字/ 日期/ 具体一个字符串(例如“Apple Store”)
    • Elasticsearch中的keyword
    • 不需要做特殊的分词处理
  • Full Text:全文本, 非结构化的文本数据
    • Elasticsearch中的text

自定义分词

  • 当Elasticsearch自带的分词器无法满足时,可以自定义分词器。通过自组合不同的组件实现
    • Character Filter
    • Tokenizer
    • Token Filter

Character Filters

  • 在Tokenizer之前对文本进行处理,例如增加删除及替换字符。可以配置多个Character Filters.会影响Tokenizer的position和offset信息
  • 一些自带的Character Filters
    • HTML strip - 去除html标签
 POST _analyze
{
  "tokenizer": "keyword",
  "char_filter": ["html_strip"],
  "text": "hello world"
}
// 返回值
{
  "token": [
    "token": "hello world",
    "start_offset": 3,
    "end_offset": 18,
    "type": "word",
    "position": 0
  ]
}
  • Mapping - 字符串替换
POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [
    {
      "type": "mapping",
      "mappings": ["- => _"]
    }
  ],
  "text": "123-456, I-test! test-990 650-555-1234"
}
// 返回值
{
  "tokens" : [
    {
      "token" : "123_456",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "",
      "position" : 0
    },
    {
      "token" : "I_test",
      "start_offset" : 9,
      "end_offset" : 15,
      "type" : "",
      "position" : 1
    },
    {
      "token" : "test_990",
      "start_offset" : 17,
      "end_offset" : 25,
      "type" : "",
      "position" : 2
    },
    {
      "token" : "650_555_1234",
      "start_offset" : 26,
      "end_offset" : 38,
      "type" : "",
      "position" : 3
    }
  ]
}

// 替换表情
POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [
    {
    "type": "mapping",
    "mappings":[":) => happy", ":( => sad"]
    }
    ],
    "text": ["I am felling :)", "Feeling :( today"]
}
// 返回值
{
  "tokens" : [
    {
      "token" : "I",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "",
      "position" : 0
    },
    {
      "token" : "am",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "",
      "position" : 1
    },
    {
      "token" : "felling",
      "start_offset" : 5,
      "end_offset" : 12,
      "type" : "",
      "position" : 2
    },
    {
      "token" : "happy",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "",
      "position" : 3
    },
    {
      "token" : "Feeling",
      "start_offset" : 16,
      "end_offset" : 23,
      "type" : "",
      "position" : 104
    },
    {
      "token" : "sad",
      "start_offset" : 24,
      "end_offset" : 26,
      "type" : "",
      "position" : 105
    },
    {
      "token" : "today",
      "start_offset" : 27,
      "end_offset" : 32,
      "type" : "",
      "position" : 106
    }
  ]
}

  • Pattern replace - 正则匹配替换
GET _analyze
{
  "tokenizer": "standard",
  "char_filter": [
    {
      "type": "pattern_replace",
      "pattern": "http://(.*)",
      "replacement": "$1"
    }
    ],
    "text": "http://www.elastic.co"
}
// 返回
{
  "tokens" : [
    {
      "token" : "www.elastic.co",
      "start_offset" : 0,
      "end_offset" : 21,
      "type" : "",
      "position" : 0
    }
  ]
}

Tokenizer

  • 将原始的文本按照一定的规则,切分为词(term or token)
  • Elasticsearch内置的Tokenizers
    • whitespace / standard / uax_url_email / pattern / keyword / path hierarchy
  • 可以自己实现Tokenizer
POST _analyze
{
  "tokenizer": "path_hierarchy",
  "text": "/user/ymruan/a/b/c/d/e"
}

// 返回
{
  "tokens" : [
    {
      "token" : "/user",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/user/ymruan",
      "start_offset" : 0,
      "end_offset" : 12,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/user/ymruan/a",
      "start_offset" : 0,
      "end_offset" : 14,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/user/ymruan/a/b",
      "start_offset" : 0,
      "end_offset" : 16,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/user/ymruan/a/b/c",
      "start_offset" : 0,
      "end_offset" : 18,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/user/ymruan/a/b/c/d",
      "start_offset" : 0,
      "end_offset" : 20,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/user/ymruan/a/b/c/d/e",
      "start_offset" : 0,
      "end_offset" : 22,
      "type" : "word",
      "position" : 0
    }
  ]
}

Token Filters

  • 将Tokenizer输出的单词(term),进行增加,修改,删除
  • 自带的Token Filters
    • Lowercase / stop/ synonym(添加近义词)
GET analyze
{
  "tokenizer": "whitespace",
  "filter": ["stop"],
  "text": ["The rain in Spain falls mainly on the plain."]
}

// 返回值
{
  "tokens" : [
    {
      "token" : "The",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "rain",
      "start_offset" : 4,
      "end_offset" : 8,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "Spain",
      "start_offset" : 12,
      "end_offset" : 17,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "falls",
      "start_offset" : 18,
      "end_offset" : 23,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "mainly",
      "start_offset" : 24,
      "end_offset" : 30,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "plain.",
      "start_offset" : 38,
      "end_offset" : 44,
      "type" : "word",
      "position" : 8
    }
  ]
}

设置一个Custom Analyzer

{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english": {
          "type": "english",
          "stem_exclusion": ["organization", "organizations"],
          "stopwords": [
            "a","an","and","are","as","at","be","but","by","for","if","in","into","is","it","of","on","or",
            "such", "that", "the", "their", "then", "there", "these", "they", "this", "to","was","will","with"
          ]
        }
      }
    }
  }
}
PUT mapping_test/_doc/1
{
  "firstName": "Chan",
  "lastName": "Jackie",
  "loginDate": "2018-07-24T10:29:48.103Z"
}
GET mapping_test/_mapping
DELETE mapping_test
PUT mapping_test/_doc/1
{
  "uid":"123",
  "isVip": false,
  "isAdmin": "true",
  "age":19,
  "heigh":180
}

POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [
    {
      "type": "mapping",
      "mappings": ["- => _"]
    }
  ],
  "text": "123-456, I-test! test-990 650-555-1234"
}

POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [
    {
    "type": "mapping",
    "mappings":[":) => happy", ":( => sad"]
    }
    ],
    "text": ["I am felling :)", "Feeling :( today"]
}

GET _analyze
{
  "tokenizer": "standard",
  "char_filter": [
    {
      "type": "pattern_replace",
      "pattern": "http://(.*)",
      "replacement": "$1"
    }
    ],
    "text": "http://www.elastic.co"
}

POST _analyze
{
  "tokenizer": "path_hierarchy",
  "text": "/user/ymruan/a/b/c/d/e"
}

GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["stop"],
  "text": ["The rain in Spain falls mainly on the plain."]
}

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "char_filter": [
            "emoticons"
            ],
            "tokenizer":"punctuation",
            "filter":[
              "lowercase",
              "english_stop"
              ]
        }
      },
      "tokenizer": {
        "punctuation": {
          "type": "pattern",
          "pattern": "[.,!?]"
        }
      },
      "char_filter": {
        "emoticons": {
          "type": "mapping",
          "mappings": [
            ":) => _happy_",
            ":( => _sad_"
            ]
        }
      },
      "filter": {
        "english_stop": {
          "type": "stop",
          "stopwords": "_english_"
        }
      }
    }
  }
}

// 返回
{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "my_index"
}

POST my_index/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text": "I'm a :) person, and you ?"
}
// 返回
{
  "tokens" : [
    {
      "token" : "i'm a _happy_ person",
      "start_offset" : 0,
      "end_offset" : 15,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : " and you ",
      "start_offset" : 16,
      "end_offset" : 25,
      "type" : "word",
      "position" : 1
    }
  ]
}

你可能感兴趣的:(笔记二)