Elastic Search Conception

Elastic Stack

  • Beats 数据采集
  • LogStash 数据转换
  • ElasticSearch 存储/索引/聚合
  • Kibana 数据可视化

节点角色

节点角色 配置+默认值
Master Eligible (主节点候选) node.master=true
Data node.data=true
Machine Learning node.ml=true && xpack.ml.enabled
Ingest(预处理) node.ingest=true
Coordinating only(只用于协同)
除 xpack 设置其他值全为 false;
各种节点都包含协同功能, 不能禁用;
用来处理收集数据.
  • 任何一个节点都了解集群中其他节点的节点, 可以转发请求到合适的节点;
  • 默认配置下, 任何节点都可以处理 HTTP查询 和 传输数据;

Data Node 持有:

  • Shards 数据
  • 集群元数据+索引元数据

Master Eligible Node 持有:

  • 集群元数据+索引元数据

https://www.elastic.co/guide/en/elasticsearch/reference/7.6/modules-node.html

分片

Primary Shard 主分片 Replica Shard 副本分片
For: 水平扩展. 各个主分片被分配到多台机器 For: 高可用. 为主分片拷贝.
主分片数在索引创建时确定, 之后不能修改. 可动态增减副本分片, 来调整可用性和读取性能.
  • 一个分片即一个 Lucene 实例;

状态查看 API

# 查看集群状态
GET /_cluster/health

# 查看节点列表
GET /_cat/nodes?v

# 查看分片列表
GET /_cat/shards?v

# 查看索引概况列表
GET /_cat/indices?v

# 查看改索引概况
GET /_cat/indices/movies?v

# 查看该索引的 Setting 和 Mapping
GET /movies

CRUD

  • 创建文档时让 ES 自动生成 ID, 需要使用 POST, PUT 必须指明 ID.
# get: 读取 by ID
GET /users/_doc/123
GET /users/_doc/234

# create: 只新建(ID不存在的文档)
POST /users/_create/234
{
  "firstName": "BD",
  "lastName": "C",
  "tags": ["boy", "engineer"]
}

# index: 新建 or 全量覆盖
PUT /users/_doc/123
{
  "firstName": "BD",
  "lastName": "C",
  "tags": ["boy", "engineer"]
}

# update: 只部分修改(ID已存在的文档)
POST /users/_update/123
{
  "doc": {
    "firstName": "Focus"
  }
}

# delete: 删除 by ID
DELETE /users/_doc/123
DELETE /users/_doc/234

# 批量操作 (出错继续执行)
POST /_bulk
{"delete": {"_index": "users", "_id": 123}}
{"create": {"_index": "users", "_id": 123}}
{"tags": ["boy", "engineer"]}
{"index": {"_index": "users", "_id": 123}}
{"firstName": "bulk FN", "lastName": "bulk LN"}
{"update": {"_index": "users", "_id": 123}}
{"doc": {"tags": ["PHP", "Ruby", "Java"]}}


# 批量读取
GET /_mget
{
  "docs": [
    {
      "_index": "users",
      "_id": 123
    },
    {
      "_index": "users",
      "_id": 234
    }
  ]
}

倒排索引

索引 倒排索引
key id: 123 term: Ming
value name: Xiao Ming Ming, age: 19 id: 123, term_frequency: 2, position: [5, 10], offset: [[5, 9], [10, 14]]

分词器 Analyzer

Character Filter => Tokenizer => Token Filter

查询

Path Range
/_search 全部索引
/index1/_search index1
/index1,index2/_search index1 和 index2
/index*/_search 以 index 开头的索引

基于 URI 的查询

# 在全部索引的任意字段搜索 Jack
GET /_search?q=Jack

# 在全部索引的title字段搜索 Jack
GET /_search?q=title:Jack

# 在 users 索引的任意字段搜索 Jack
GET /users/_search?q=Jack

# 在 users 索引的 firstName 字段搜索 Jack
GET /users/_search?q=firstName:Jack

# 在全部索引上模糊查询
GET /_search?q=ExhalX~1

# 查询要求包含 Waiting 或 Exhale
GET /_search?q=Waiting Exhale

# 查询要求包含 Waiting 和  Exhale, 并且要求 Exhale 的位置紧跟 Waiting 后面
# 位置(position)相邻, 不要求中间空格的数量
GET /_search?q="Waiting Exhale"

基于 Request Body 的查询

# 查询所有
GET /movies/_search
{
  "query": {
    "match_all": {}
  }
}
# `_source`  选定需要返回的字段;
# `from` `size` 游标;
# script_fields 脚本
GET /movies/_search
{
  "_source": [
    "id",
    "title"
  ],
  "from": 0,
  "size": 5,
  "query": {
    "match_all": {}
  },
  "script_fields": {
    "beautiful_title": {
      "script": {
        "lang": "painless",
        "source": "'

' + doc['title.keyword'].value + '

'" } } } }

# 短语匹配
# 先将文本分词, 拆成 term, (包含位置顺序)
# 要求 term 全部能搜索到, 并且位置顺序一致 (空格不要求)
GET /movies/_search
{
  "query": {
    "match_phrase": {
      "title": "Waiting to Exhale"
    }
  }
}
# slop 让 term 的顺序不再严格
# slop 为 2, 颠倒临近词位置可查
# slop 为 3, 中间有连接词时颠倒位置可查
GET /movies/_search
{
  "query": {
    "match_phrase": {
      "title": {
        "query": "Exhale Waiting",
        "slop": 3,
        "analyzer": "simple"
      }
    }
  }
}


Information Retrieval

  • precision 查准率, 尽可能少返回无关文档 (返回中判断正确的个数/返回的总个数)
  • recall 查全率, 尽量返回更多的相关文档 (返回中判断正确的个数/所有正确的个数)
  • ranking 按相关度排序

true or false, positive or negative

  • true positive: 判断正确, 被返回
  • false positive: 判断错误, 被返回
  • true negative: 判断正确, 没有返回
  • false negative: 判断错误, 没有返回

Mapping

-- 简单类型:

  • Text / Keyword
  • Date
  • Integer / Floating
  • Boolean
  • IPv4 / IPv6

-- 复杂类型:

  • 对象类型
  • 嵌套类型

-- 特殊类型:

  • Geo point
  • Geo shape
  • Percolator
_doc Mapping Dynamic 文档可搜索 字段可索引 Mapping 可更新
true default
false × ×
strict × × ×
  • 对于已存在字段, 不能修改(除非 ReIndex);
  • Mapping Dynamic 为 false时, 新增字段会保存在 _source 中, 没有Mapping的更新, 不能被搜索;
  • 是否能被搜索取决于 Mapping 否是有更新;
DELETE /demo

PUT /demo/_doc/1
{
  "name": "Xiao Ming"
}

GET /demo/_doc/1

GET /demo/_mapping

# 修改 Mapping 的 Dynamic 属性
POST /demo/_mapping
{
  "dynamic": "strict"
}

GET /demo/_mapping

# 尝试插入新的字段 (报错)
POST /demo/_update/1
{
  "doc": {
    "formats": [
      "json",
      "xml"
    ]
  }
}

# 修改 Mapping 的 Dynamic 属性
POST /demo/_mapping
{
  "dynamic": "false"
}

GET /demo/_mapping

# 尝试插入新的字段
# 新字段存入 _source, mapping 没有修改
POST /demo/_update/1
{
  "doc": {
    "name": "Xiao Gang",
    "tags": [
      "Ruby",
      "Java"
    ]
  }
}

GET /demo/_doc/1

# mapping 中存在的字段可以被索引
GET /demo/_search
{
  "query": {
    "match": {
      "name": "xiao"
    }
  }
}

# 新字段不能被索引(Mapping 不存在该字段)
GET /demo/_search
{
  "query": {
    "match": {
      "tags": "ruby"
    }
  }
}

Mapping Definition

"index": false, 不创建倒排索引, 不能被索引.
创建索引 mapping, 只能使用 PUT /index_name {"mappings": {}} .

DELETE /demo

PUT /demo
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text"
      },
      "mobile": {
        "type": "text",
        "index": false
      }
    }
  }
}

GET /demo/_mapping

POST /demo/_doc/1
{
  "name": "Xiao Ming",
  "mobile": "021-1234567"
}

GET /demo/_search
{
  "query": {
    "match": {
      "name": "Xiao"
    }
  }
}

# Cannot search on field [mobile] since it is not indexed.
GET /demo/_search
{
  "query": {
    "match": {
      "mobile": "021"
    }
  }
}

Index Options 级别

级别 包含内容
docs doc id
freqs doc id, term frequencies
positions doc id, term frequencies, term position
offsets doc id, term frequencies, term position, character offsets
  • text 默认是 positions 级别, 其他默认为 docs

NULL Value

elasticsearch 本身不能存储空值, 默认情况下, null[] 都被认为是空值.
null_value 可以把空值替换为指定的"空值替身"进行索引.
_source 依旧显示原值, "空值替身" 仅在索引时有效.
"空值替身" 的数据类型要跟属性匹配.
text 不能应用该属性,keyword 可以.

DELETE /demo

PUT /demo
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "fields": {
          "raw": {
            "type": "keyword",
            "null_value": "NULLVALUE"
          }
        }
      },
      "age": {
        "type": "integer",
        "null_value": 18
      }
    }
  }
}

GET /demo/_mapping

POST /demo/_doc/1
{
  "name": null,
  "age": null
}

GET /demo/_doc/1

GET /demo/_search
{
  "query": {
    "match": {
      "name.raw": "NULLVALUE"
    }
  }
}

copy_to

_all 在新版本中已经弃用, _copy_to 可以实现类似的功能.
目标值不会出现在 _source 中.
copy_to 的目标本身也可以存值.
源值更新了, 目标索引效果会跟随更新.

DELETE /demo

PUT /demo
{
  "mappings": {
    "properties": {
      "first": {
        "type": "text",
        "copy_to": "full"
      },
      "second": {
        "type": "text",
        "copy_to": "full"
      },
      "full": {
        "type": "text"
      }
    }
  }
}

GET /demo/_mapping

POST /demo/_doc/1
{
  "full": "hi",
  "first": "HELLO",
  "second": "WORLD"
}

GET /demo/_doc/1

GET /demo/_search
{
  "query": {
    "match": {
      "full": "hello"
    }
  }
}

POST /demo/_update/1
{
  "doc": {
    "first": "ruby",
    "second": "java"
  }
}

GET /demo/_doc/1

GET /demo/_search
{
  "query": {
    "match": {
      "full": "hello"
    }
  }
}

fields

类型自动映射时, 会把字符串自动设置为:
即利用了多字段特性, 在一个属性上使用多种类型.

POST /demo
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      }
    }
  }
}

Analyzer

处理顺序: Character Filter => Tokenizer => Token Filter

Character Filter:
负责对源文本的处理, 可以设置多个字符过滤器, 按顺序执行.
例如, 先去除 HTML 标签, 再应用自定义替换规则:

GET /_analyze
{
  "char_filter": [
    "html_strip",
    {
      "type": "mapping",
      "mappings": [
        "_ => -",
        ":( => __unhappy__",
        ":) => __happy__"
      ]
    }
  ],
  "text": [
    "

Hello_world :)

" ] }

Tokenizer:
负责分隔 Term, 仅允许设置一个.
它还会负责记录 term 的 order, position, offset 信息.

GET /_analyze
{
  "tokenizer": ["whitespace"],
  "text": "A big Apple."
}

Token Filter:
对分隔完成的 token 执行过滤和其他操作(例如同义词), 即 Token 的标准化. 可以设置多个.

GET /_analyze
{
  "tokenizer": "whitespace",
  "filter": [
    "lowercase",
    "stop"
  ],
  "text": "A big Apple."
}

自定义 analyzer

DELETE /demo

PUT /demo
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "char_filter": [
            "my_char_filter"
          ],
          "tokenizer": "my_tokenizer",
          "filter": [
            "lowercase",
            "my_token_filter"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "mapping",
          "mappings": [
            ":) => __happy__",
            ":( => __unhappy__"
          ]
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": "[,.!?-@]"
        }
      },
      "filter": {
        "my_token_filter": {
          "type": "stop",
          "stopwords": [
            "hi",
            "hello"
          ]
        }
      }
    }
  }
}

GET /demo/_analyze
{
  "analyzer": "my_analyzer",
  "text": "Hello, @Big-Apple! :)"
}

自定义同义词分析器:

  • 写入或检索时, 会进行同义词替换
  • expand 默认为 true, 意为同义词之间可相互替换;
  • expand 设为 false, 将后续词映射到第一个词;
DELETE /demo

PUT /demo
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "my_synonym"
          ]
        }
      },
      "filter": {
        "my_synonym": {
          "type": "synonym",
          "expand": true,
          "synonyms": [
            "IT, Internet",
            "IT, Internet Technology",
            "IT, Integration Testing"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "job": {
        "type": "text",
        "analyzer": "my_analyzer",
        "search_analyzer": "standard"
      }
    }
  }
}

GET /demo/_analyze
{
  "analyzer": "my_analyzer",
  "text": "Integration Testing"
}

POST /_bulk
{"index": {"_index": "demo", "_id": 1}}
{"job": "Testing"}
{"index": {"_index": "demo", "_id": 2}}
{"job": "IT"}
{"index": {"_index": "demo", "_id": 3}}
{"job": "Internet"}
{"index": {"_index": "demo", "_id": 4}}
{"job": "IT"}
{"index": {"_index": "demo", "_id": 5}}
{"job": "Internet Technology"}
{"index": {"_index": "demo", "_id": 6}}
{"job": "Integration Testing"}


GET /demo/_search
{
  "query": {
    "match": {
      "job": "Testing"
    }
  }
}

分析器工作在两个阶段:

  • 索引时期 analyzer
  • 检索时期 search_analyzer

Index analyzer 判断顺序:
1. 该属性上的 analyzer mapping 参数
2. 索引 settings 中的 analysis.analyzer.default
3. 默认的 standard analyzer

Search analyzer 判断顺序:
1. 该检索指定的 analyzer
2. 该属性上的 search_analyzer mapping 参数
3. 索引 settings 中的 analysis.analyzer.default_search
4. 该属性上的 analyzer mapping 参数
5. 默认的 standard analyzer

https://www.elastic.co/guide/en/elasticsearch/reference/current/specify-analyzer.html#specify-index-time-analyzer

Index Template

  • 相当于全局设置, 设置在符合匹配规则的索引上.
  • 仅在索引创建时有效(删除模板不影响既有索引, 新加的模板不影响原有索引).

索引创建时的设置顺序:
1. 默认的 settings / mappings;
2. 根据 index template order 的顺序, 从 0 到大依次覆盖生效;
3. 用户指定的 settings / mappings 覆盖以上.

低阶模板提供基础设置, 高阶模板提供特定设置.

# 查看所有 _template
GET /_template
GET /_template/*

# 查看特定的 _template
GET /_template/my_template
GET /_template/my*

DELETE /_template/my_template

POST /_template/my_template
{
  "order": 1, 
  "index_patterns": [
    "*"
  ],
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "my_synonym"
          ]
        }
      },
      "filter": {
        "my_synonym": {
          "type": "synonym",
          "synonyms": [
            "IT, Internet",
            "IT, Internet Technology",
            "IT, Integration Testing"
          ]
        }
      }
    }
  },
  "mappings": {}
}

DELETE /demo

PUT /demo
{
  "mappings": {
    "properties": {
    "job": {
      "type": "text",
      "analyzer": "my_analyzer",
      "search_analyzer": "simple"
    }
  }
  }
}

GET /demo/_mapping

GET /demo/_analyze
{
  "field": "job",
  "text": "Internet Technology"
}

Dynamic Template

设置在特定索引上, 提供了更方便的匹配方式.

https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-templates.html

聚合 aggs

GET /kibana_sample_data_flights/_search
{
  "size": 0,
  "aggs": {
    "my_aggs_dest": {
      "terms": {
        "field": "DestCountry",
        "size": 3
      },
      "aggs": {
        "my_price": {
          "stats": {
            "field": "AvgTicketPrice"
          }
        }
      }
    }
  }
}

你可能感兴趣的:(Elastic Search Conception)