Elasticsearch Index查询优化及Mapping分词深入剖析-搜索系统线上实战

本套技术专栏作者（秦凯新）专注于大数据及容器云核心技术解密，具备5年工业级IOT大数据云平台建设经验，可提供全栈的大数据+云原生平台咨询方案，请持续关注本套博客。QQ邮箱地址：[email protected]，如有任何学术交流，可随时联系。

1 multi-index和multi-type搜索模式

    /_search：所有索引，所有type下的所有数据都搜索出来
    /index1/_search：指定一个index，搜索其下所有type的数据
    /index1,index2/_search：同时搜索两个index下的数据
    /*1,*2/_search：按照通配符去匹配多个索引
    /index1/type1/_search：搜索一个index下指定的type的数据
    /index1/type1,type2/_search：可以搜索一个index下多个type的数据
    /index1,index2/type1,type2/_search：搜索多个index下的多个type的数据
    /_all/type1,type2/_search：_all，可以代表搜索所有index下的指定type的数据
复制代码

2 分页搜索（防止Deep Paging）

size 表示页大小，from表示从第几个document开始查询

        GET /_search?size=10
        GET /_search?size=10&from=0
        GET /_search?size=10&from=20



        GET /test_index/test_type/_search
        
        "hits": {
            "total": 9,
            "max_score": 1,
复制代码

我们假设将这9条数据分成3页，每一页是3条数据，来实验一下这个分页搜索的效果

        GET /test_index/test_type/_search?from=0&size=3
        
        {
          "took": 2,
          "timed_out": false,
          "_shards": {
            "total": 5,
            "successful": 5,
            "failed": 0
          },
          "hits": {
            "total": 9,
            "max_score": 1,
            "hits": [
              {
                "_index": "test_index",
                "_type": "test_type",
                "_id": "8",
                "_score": 1,
                "_source": {
                  "test_field": "test client 2"
                }
              },
              {
                "_index": "test_index",
                "_type": "test_type",
                "_id": "6",
                "_score": 1,
                "_source": {
                  "test_field": "tes test"
                }
              },
              {
                "_index": "test_index",
                "_type": "test_type",
                "_id": "4",
                "_score": 1,
                "_source": {
                  "test_field": "test4"
                }
              }
            ]
          }
        }
        
        第一页：id=8,6,4
        
        GET /test_index/test_type/_search?from=3&size=3
        
        第二页：id=2,自动生成,7
        
        GET /test_index/test_type/_search?from=6&size=3
        
        第三页：id=1,11,3
复制代码

3 mapping原理

3.1 dynamic mapping初体验

自动或手动为index中的type建立的一种数据结构和相关配置，简称为mapping。
dynamic mapping，自动为我们建立index，创建type，以及type对应的mapping，mapping中包含了每个field对应的数据类型，以及如何分词等设置我们当然，后面会讲解，也可以手动在创建数据之前，先创建index和type，以及type对应的mapping。
插入几条数据，让es自动为我们建立一个索引

        PUT /website/article/1
        {
          "post_date": "2017-01-01",
          "title": "my first article",
          "content": "this is my first article in this website",
          "author_id": 11400
        }
        
        PUT /website/article/2
        {
          "post_date": "2017-01-02",
          "title": "my second article",
          "content": "this is my second article in this website",
          "author_id": 11400
        }
        
        PUT /website/article/3
        {
          "post_date": "2017-01-03",
          "title": "my third article",
          "content": "this is my third article in this website",
          "author_id": 11400
        }
复制代码

尝试各种搜索（发现不能做到精确匹配）

        GET /website/article/_search?q=2017			        3条结果             
        GET /website/article/_search?q=2017-01-01        	        3条结果
        GET /website/article/_search?q=post_date:2017-01-01   	1条结果
        GET /website/article/_search?q=post_date:2017         	1条结果
复制代码

查看mapping，搜索结果为什么不一致，因为es自动建立mapping的时候，设置了不同的field不同的data type。不同的data type的分词、搜索等行为是不一样的。所以出现了_all field和post_date field的搜索表现完全不一样。

        GET /website/_mapping/article
        
        {
          "website": {
            "mappings": {
              "article": {
                "properties": {
                  "author_id": {
                    "type": "long"
                  },
                  "content": {
                    "type": "text",
                    "fields": {
                      "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                      }
                    }
                  },
                  "post_date": {
                    "type": "date"
                  },
                  "title": {
                    "type": "text",
                    "fields": {
                      "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                      }
                    }
                  }
                }
              }
            }
          }
        }
复制代码

3.2 dynamic mapping揭秘

分词，初步的倒排索引的建立

doc1：I really liked my small dogs, and I think my mom also liked them.
doc2：He never liked any dogs, so I hope that my mom will not expect me to liked him.

复制代码

执行搜索

    mother like little dog，
    不可能有任何结果。
    这个是不是我们想要的搜索结果？？？绝对不是，因为在我们看来，mother和mom有区别吗？同义词，都是妈妈的意思。
    like和liked有区别吗？没有，都是喜欢的意思，只不过一个是现在时，一个是过去时。little和small有区别吗？
    同义词，都是小小的。dog和dogs有区别吗？狗，只不过一个是单数，一个是复数。
复制代码

normalization，建立倒排索引的时候，会执行一个操作，也就是说对拆分出的各个单词进行相应的处理，以提升后面搜索的时候能够搜索到相关联的文档的概率

        时态的转换，单复数的转换，同义词的转换，大小写的转换
        mom —> mother
        liked —> like
        small —> little
        dogs —> dog
复制代码

重新建立倒排索引，加入normalization，再次用mother liked little dog搜索，就可以搜索到了

3.3 什么是分词器

切分词语，normalization（提升recall召回率），例如：给你一段句子，然后将这段句子拆分成一个一个的单个的单词，同时对每个单词进行normalization（时态转换，单复数转换），分瓷器 recall，召回率：搜索的时候，增加能够搜索到的结果的数量

    character filter：在一段文本进行分词之前，先进行预处理，比如说最常见的就是，过滤
    html标签（hello --> hello），& --> and（I&you --> I and you）
    
    tokenizer：分词，hello you and me --> hello, you, and, me
    
    token filter：lowercase，stop word，synonymom，dogs --> dog，liked --> like，Tom --> tom，a/the/an --> 干掉，mother --> mom，small --> little
复制代码

一个分词器，很重要，将一段文本进行各种处理，最后处理好的结果才会拿去建立倒排索引
内置分词器的介绍

    Set the shape to semi-transparent by calling set_trans(5)
    
    standard analyzer：set, the, shape, to, semi, transparent, by, calling, set_trans, 5（默认的是standard）
    
    simple analyzer：set, the, shape, to, semi, transparent, by, calling, set, trans
    
    whitespace analyzer：Set, the, shape, to, semi-transparent, by, calling, set_trans(5)
    
    language analyzer（特定的语言的分词器，比如说，english，英语分词器）：set, shape, semi, transpar, call, set_tran, 5
复制代码

3.4 遗留问题揭秘

query string分词

    query string必须以和index建立时相同的analyzer进行分词
    query string对exact value和full text的区别对待
    
     知识点：不同类型的field，可能有的就是full text，有的就是exact value
    
    post_date，date：exact value
    _all：full text，分词，normalization
    
    GET /_search?q=2017

    搜索的是_all field，document所有的field都会拼接成一个大串，进行分词
    
    2017-01-02 my second article this is my second article in this website 11400
    
    		doc1		doc2		doc3
    2017		*		*		    *
    01		* 		
    02				*
    03						    *
    
    _all，2017，自然会搜索到3个docuemnt
    
    GET /_search?q=2017-01-01
    
    _all，2017-01-01，query string会用跟建立倒排索引一样的分词器去进行分词
    
    2017
    01
    01
    
    GET /_search?q=post_date:2017-01-01

    date，会作为exact value去建立索引
    
    		doc1		doc2		doc3
    2017-01-01	*		
    2017-01-02			* 		
    2017-01-03					*
    
    post_date:2017-01-01，2017-01-01，doc1一条document
    
    GET /_search?q=post_date:2017，这个在这里不讲解，因为是es 5.2以后做的一个优化
    
    
复制代码

测试分词器

    GET /_analyze
    {
      "analyzer": "standard",
      "text": "Text to analyze"
    }
复制代码

3.5 Mapping原理

往es里面直接插入数据，es会自动建立索引，同时建立type以及对应的mapping
mapping中就自动定义了每个field的数据类型
不同的数据类型（比如说text和date），可能有的是exact value，有的是full text
exact value，在建立倒排索引的时候，分词的时候，是将整个值一起作为一个关键词建立到倒排索引中的；full text，会经历各种各样的处理，分词，normaliztion（时态转换，同义词转换，大小写转换），才会建立到倒排索引中
同时呢，exact value和full text类型的field就决定了，在一个搜索过来的时候，对exact value field或者是full text field进行搜索的行为也是不一样的，会跟建立倒排索引的行为保持一致；比如说exact value搜索的时候，就是直接按照整个值进行匹配，full text query string，也会进行分词和normalization再去倒排索引中去搜索
可以用es的dynamic mapping，让其自动建立mapping，包括自动设置数据类型；也可以提前手动创建index和type的mapping，自己对各个field进行设置，包括数据类型，包括索引行为，包括分词器，等等
mapping，就是index的type的元数据，每个type都有一个自己的mapping，决定了数据类型，建立倒排索引的行为，还有进行搜索的行为

3.6 Mapping数据结构

核心的数据类型

    string
    byte，short，integer，long
    float，double
    boolean
    date
复制代码

dynamic mapping

    true or false	-->	boolean
    123		-->	long
    123.45		-->	double
    2017-01-01	-->	date
    "hello world"	-->	string/text
复制代码

查看mapping

    GET /index/_mapping/type
复制代码

3.7 Mapping手动建立索引

如何建立索引

    analyzed
    not_analyzed
    no
复制代码

修改mapping

    PUT /website
    {
      "mappings": {
        "article": {
          "properties": {
            "author_id": {
              "type": "long"
            },
            "title": {
              "type": "text",
              "analyzer": "english"
            },
            "content": {
              "type": "text"
            },
            "post_date": {
              "type": "date"
            },
            "publisher_id": {
              "type": "text",
              "index": "not_analyzed"
            }
          }
        }
      }
    }
复制代码

如下修改索引，是会报错的

       PUT /website
        {
          "mappings": {
            "article": {
              "properties": {
                "author_id": {
                  "type": "text"
                }
              }
            }
          }
        }
        
        {
          "error": {
            "root_cause": [
              {
                "type": "index_already_exists_exception",
                "reason": "index [website/co1dgJ-uTYGBEEOOL8GsQQ] already exists",
                "index_uuid": "co1dgJ-uTYGBEEOOL8GsQQ",
                "index": "website"
              }
            ],
            "type": "index_already_exists_exception",
            "reason": "index [website/co1dgJ-uTYGBEEOOL8GsQQ] already exists",
            "index_uuid": "co1dgJ-uTYGBEEOOL8GsQQ",
            "index": "website"
          },
          "status": 400
        }
复制代码

修改索引正确方法

        PUT /website/_mapping/article
        {
          "properties" : {
            "new_field" : {
              "type" :    "string",
              "index":    "not_analyzed"
            }
          }
        }
复制代码

测试mapping

        GET /website/_analyze
        {
          "field": "content",
          "text": "my-dogs" 
        }
        
        GET website/_analyze
        {
          "field": "new_field",
          "text": "my dogs"
        }
        
        {
          "error": {
            "root_cause": [
              {
                "type": "remote_transport_exception",
                "reason": "[4onsTYV][127.0.0.1:9300][indices:admin/analyze[s]]"
              }
            ],
            "type": "illegal_argument_exception",
            "reason": "Can't process field [new_field], Analysis requests are only supported on tokenized fields"
          },
          "status": 400
        }
复制代码

multivalue field，建立索引时与string是一样的，数据类型不能混

        { "tags": [ "tag1", "tag2" ]}
复制代码

empty field

        null，[]，[null]
复制代码

object field

        PUT /company/employee/1
        {
          "address": {
            "country": "china",
            "province": "guangdong",
            "city": "guangzhou"
          },
          "name": "jack",
          "age": 27,
          "join_date": "2017-01-01"
        }
        
        address：object类型
        
        GET /company/_mapping/employee
        
        {
          "company": {
            "mappings": {
              "employee": {
                "properties": {
                  "address": {
                    "properties": {
                      "city": {
                        "type": "text",
                        "fields": {
                          "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                          }
                        }
                      },
                      "country": {
                        "type": "text",
                        "fields": {
                          "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                          }
                        }
                      },
                      "province": {
                        "type": "text",
                        "fields": {
                          "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                          }
                        }
                      }
                    }
                  },
                  "age": {
                    "type": "long"
                  },
                  "join_date": {
                    "type": "date"
                  },
                  "name": {
                    "type": "text",
                    "fields": {
                      "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                      }
                    }
                  }
                }
              }
            }
          }
        }
        
复制代码

ES底层数据结构1

        
        {
          "address": {
            "country": "china",
            "province": "guangdong",
            "city": "guangzhou"
          },
          "name": "jack",
          "age": 27,
          "join_date": "2017-01-01"
        }
        
        {
            "name":            [jack],
            "age":          [27],
            "join_date":      [2017-01-01],
            "address.country":         [china],
            "address.province":   [guangdong],
            "address.city":  [guangzhou]
        }
复制代码

ES底层数据结构2

        {
            "authors": [
                { "age": 26, "name": "Jack White"},
                { "age": 55, "name": "Tom Jones"},
                { "age": 39, "name": "Kitty Smith"}
            ]
        }
        
        {
            "authors.age":    [26, 55, 39],
            "authors.name":   [jack, white, tom, jones, kitty, smith]
        }
复制代码

4 Query DSL 分析

Query DSL使用

        GET /_search
        {
            "query": {
                "match_all": {}
            }
        }
复制代码

Query DSL的基本语法

        {
            QUERY_NAME: {
                ARGUMENT: VALUE,
                ARGUMENT: VALUE,...
            }
        }
        
        {
            QUERY_NAME: {
                FIELD_NAME: {
                    ARGUMENT: VALUE,
                    ARGUMENT: VALUE,...
                }
            }
        }
        
        GET /test_index/test_type/_search 
        {
          "query": {
            "match": {
              "test_field": "test"
            }
          }
        }
复制代码

如何组合多个搜索条件，搜索需求：title必须包含elasticsearch，content可以包含elasticsearch也可以不包含，author_id必须不为111

        GET /website/article/_search
        {
          "query": {
            "bool": {
              "must": [
                {
                  "match": {
                    "title": "elasticsearch"
                  }
                }
              ],
              "should": [
                {
                  "match": {
                    "content": "elasticsearch"
                  }
                }
              ],
              "must_not": [
                {
                  "match": {
                    "author_id": 111
                  }
                }
              ]
            }
          }
        }
        
        GET /test_index/_search
        {
            "query": {
                    "bool": {
                        "must": { "match":   { "name": "tom" }},
                        "should": [
                            { "match":       { "hired": true }},
                            { "bool": {
                                "must":      { "match": { "personality": "good" }},
                                "must_not":  { "match": { "rude": true }}
                            }}
                        ],
                        "minimum_should_match": 1
                    }
            }
复制代码

5 总结

生产部署还有很多工作要做，本文从初级思路切入，进行了问题的整合。

秦凯新

著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。

Elasticsearch Index查询优化及Mapping分词深入剖析-搜索系统线上实战

1 multi-index和multi-type搜索模式

2 分页搜索（防止Deep Paging）

3 mapping原理

3.1 dynamic mapping初体验

3.2 dynamic mapping揭秘

3.3 什么是分词器

3.4 遗留问题揭秘

3.5 Mapping原理

3.6 Mapping数据结构

3.7 Mapping手动建立索引

4 Query DSL 分析

5 总结

你可能感兴趣的:(大数据,数据结构与算法,嵌入式)