二十九、Elasticsearch高手进阶相关技术分析

1、term vector深入探查数据的情况
（1）、term vector介绍
获取document中的某个field内的各个term的统计信息

（2）、index-iime term vector实验
term vector，涉及了很多的term和field相关的统计信息，有两种方式可以采集到这个统计信息
index-time：你在mapping里配置一下，然后建立索引的时候，就直接给你生成这些term和field的统计信息了
query-time：你之前没有生成过任何的Term vector信息，然后在查看term vector的时候，直接就可以看到了，会on the fly，现场计算出各种统计信息，然后返回给你

PUT /my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "text": {
            "type": "text",
            "term_vector": "with_positions_offsets_payloads",
            "store" : true,
            "analyzer" : "fulltext_analyzer"
         },
         "fullname": {
            "type": "text",
            "analyzer" : "fulltext_analyzer"
        }
      }
    }
  },
  "settings" : {
    "index" : {
      "number_of_shards" : 1,
      "number_of_replicas" : 0
    },
    "analysis": {
      "analyzer": {
        "fulltext_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "type_as_payload"
          ]
        }
      }
    }
  }
}

PUT /my_index/my_type/1
{
  "fullname" : "Leo Li",
  "text" : "hello test test test "
}

PUT /my_index/my_type/2
{
  "fullname" : "Leo Li",
  "text" : "other hello test ..."
}

GET /my_index/my_type/1/_termvectors
{
  "fields" : ["text"],
  "offsets" : true,
  "payloads" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true
}

term statistics: 设置term_statistics=true;
返回：
total term frequency:（ttf）一个term在所有document中出现的频率;
document frequency：（doc_freq）有多少document包含这个term

field statistics:"field_statistics" : true
返回：
document count（doc_count）：有多少document包含这个field;
sum of document frequency：（sum_doc_freq）一个field中所有term的doc_freq之和;
sum of total term frequency：（sum_ttf）一个field中的所有term的ttf之和,是计算的一个index

{
  "_index": "my_index",
  "_type": "my_type",
  "_id": "1",
  "_version": 1,
  "found": true,
  "took": 1,
  "term_vectors": {
    "text": {
      "field_statistics": {
        "sum_doc_freq": 6,
        "doc_count": 2,
        "sum_ttf": 8
      },
      "terms": {
        "hello": {
          "doc_freq": 2,
          "ttf": 2,
          "term_freq": 1,
          "tokens": [
            {
              "position": 0,
              "start_offset": 0,
              "end_offset": 5,
              "payload": "d29yZA=="
            }
          ]
        },
        "test": {
          "doc_freq": 2,
          "ttf": 4,
          "term_freq": 3,
          "tokens": [
            {
              "position": 1,
              "start_offset": 6,
              "end_offset": 10,
              "payload": "d29yZA=="
            },
            {
              "position": 2,
              "start_offset": 11,
              "end_offset": 15,
              "payload": "d29yZA=="
            },
            {
              "position": 3,
              "start_offset": 16,
              "end_offset": 20,
              "payload": "d29yZA=="
            }
          ]
        }
      }
    }
  }
}

（3）、query-time term vector实验

GET /my_index/my_type/1/_termvectors
{
  "fields": ["text"],
  "offsets": true,
  "positions": true,
  "term_statistics": true,
  "field_statistics": true
}

如果条件允许，你就用query time的term vector就可以

（4）、手动指定doc的term vector

GET /my_index/my_type/_termvectors
{
  "doc" : {
    "fullname" : "Leo Li",
    "text" : "hello test test test"
  },
  "fields" : ["text"],
  "offsets" : true,
  "payloads" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true
}

（5）、手动指定analyzer来生成term vector

GET /my_index/my_type/_termvectors
{
  "doc" : {
    "fullname" : "Leo Li",
    "text" : "hello test test test"
  },
  "fields" : ["text"],
  "offsets" : true,
  "payloads" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true,
  "per_field_analyzer" : {
    "text": "standard"
  }
}

（6）、terms filter
这个就是说，根据term统计信息，过滤出你想要看到的term vector统计结果

GET /my_index/my_type/_termvectors
{
  "doc" : {
    "fullname" : "Leo Li",
    "text" : "hello test test test"
  },
  "fields" : ["text"],
  "offsets" : true,
  "payloads" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true,
  "filter" : {
      "max_num_terms" : 3,
      "min_term_freq" : 1,
      "min_doc_freq" : 1
    }
}

2、highlight高亮显示
（1）、三种highlight介绍

plain highlight，lucene highlight，默认
posting highlight，index_options=offsets
性能比plain highlight要高，因为不需要重新对高亮文本进行分词
对磁盘的消耗更少
将文本切割为句子，并且对句子进行高亮，效果更好

PUT /blog_website
{
  "mappings": {
    "blogs": {
      "properties": {
        "title": {
          "type": "text",
          "analyzer": "ik_max_word"
        },
        "content": {
          "type": "text",
          "analyzer": "ik_max_word",
          "index_options": "offsets"
        }
      }
    }
  }
}

fast vector highlight
对大field而言（大于1mb），性能更高

PUT /blog_website
{
  "mappings": {
    "blogs": {
      "properties": {
        "title": {
          "type": "text",
          "analyzer": "ik_max_word"
        },
        "content": {
          "type": "text",
          "analyzer": "ik_max_word",
          "term_vector" : "with_positions_offsets"
        }
      }
    }
  }
}

（2）、强制使用某种highlighter，比如对于开启了term vector的field而言，可以强制使用plain highlight

GET /blog_website/blogs/_search 
{
  "query": {
    "match": {
      "content": "博客"
    }
  },
  "highlight": {
    "fields": {
      "content": {
        "type": "plain"
      }
    }
  }
}

总结一下，其实可以根据你的实际情况去考虑，一般情况下，用plain highlight也就足够了，不需要做其他额外的设置,如果对高亮的性能要求很高，可以尝试启用posting highlight,如果field的值特别大，超过了1M，那么可以用fast vector highlight
(3)、、高亮片段fragment的设置

GET /blog_website/blogs/_search 
{
    "query" : {
        "match": { "user": "kimchy" }
    },
    "highlight" : {
        "fields" : {
            "content" : {"fragment_size" : 150, "number_of_fragments" : 3, "no_match_size": 150 }
        }
    }
}

fragment_size: 你一个Field的值，比如有长度是1万，但是你不可能在页面上显示这么长啊。。。设置要显示出来的fragment文本判断的长度，默认是100
number_of_fragments：可能你的高亮的fragment文本片段有多个片段，你可以指定就显示几个片段
3、使用search template将搜索模板化
（1）、search template入门

GET /blog_website/blogs/_search/template
{
  "inline" : {
    "query": { 
      "match" : { 
        "{{field}}" : "{{value}}" 
      } 
    }
  },
  "params" : {
      "field" : "title",
      "value" : "博客"
  }
}

翻译过来就是这样：

GET /blog_website/blogs/_search
{
  "query": { 
    "match" : { 
      "title" : "博客" 
    } 
  }
}

（2）、toJson 也就是在json串里面传值

GET /blog_website/blogs/_search/template
{
  "inline": "{\"query\": {\"match\": {{#toJson}}matchCondition{{/toJson}}}}",
  "params": {
    "matchCondition": {
      "title": "博客"
    }
  }
}

（3）、join将传入的多个值以什么连接在一起

GET /blog_website/blogs/_search/template
{
  "inline": {
    "query": {
      "match": {
        "title": "{{#join delimiter=' '}}titles{{/join delimiter=' '}}"
      }
    }
  },
  "params": {
    "titles": ["博客", "网站"]
  }
}

翻译过来就是：

GET /blog_website/blogs/_search
{
  "query": { 
    "match" : { 
      "title" : "博客 网站" 
    } 
  }
}

（4）、default value 取默认值

GET /blog_website/blogs/_search/template
{
  "inline": {
    "query": {
      "range": {
        "views": {
          "gte": "{{start}}",
          "lte": "{{end}}{{^end}}20{{/end}}"
        }
      }
    }
  },
  "params": {
    "start": 1,
    "end": 10
  }
}

（5）conditional（条件判断执行）
es的config/scripts目录下，预先保存这个复杂的模板，后缀名是.mustache，文件名是conditonal，内容是：

{
  "query": {
    "bool": {
      "must": {
        "match": {
          "line": "{{text}}" 
        }
      },
      "filter": {
        {{#line_no}} 
          "range": {
            "line_no": {
              {{#start}} 
                "gte": "{{start}}" 
                {{#end}},{{/end}} 
              {{/start}} 
              {{#end}} 
                "lte": "{{end}}" 
              {{/end}} 
            }
          }
        {{/line_no}} 
      }
    }
  }
}

GET /my_index/my_type/_search/template
{
  "file": "conditional",
  "params": {
    "text": "博客",
    "line_no": true,
    "start": 1,
    "end": 10
  }
}

5、completion suggest实现搜索提示

PUT /news_website
{
  "mappings": {
    "news" : {
      "properties" : {
        "title" : {
          "type": "text",
          "analyzer": "ik_max_word",
          "fields": {
            "suggest" : {
              "type" : "completion",
              "analyzer": "ik_max_word"
            }
          }
        },
        "content": {
          "type": "text",
          "analyzer": "ik_max_word"
        }
      }
    }
  }
}

GET /news_website/news/_search
{
  "suggest": {
    "my-suggest" : {
      "prefix" : "大话西游",
      "completion" : {
        "field" : "title.suggest"
      }
    }
  }
}

6、用动态映射模板定制自己的映射策略
（1）、根据类型匹配映射模板
动态映射模板，有两种方式，
第一种，是根据新加入的field的默认的数据类型，来进行匹配，匹配上某个预定义的模板；
第二种，是根据新加入的field的名字，去匹配预定义的名字，或者去匹配一个预定义的通配符，然后匹配上某个预定义的模板

PUT my_index
{
  "mappings": {
    "my_type": {
      "dynamic_templates": [
        {
          "integers": {
            "match_mapping_type": "long",//如果自动是long就转成integer
            "mapping": {
              "type": "integer"
            }
          }
        },
        {
          "strings": {
            "match_mapping_type": "string",//如果是string，转成text,加个内置的raw
            "mapping": {
              "type": "text",
              "fields": {
                "raw": {
                  "type": "keyword",
                  "ignore_above": 500
                }
              }
            }
          }
        }
      ]
    }
  }
}

（2）、根据字段名配映射模板

PUT /my_index 
{
  "mappings": {
    "my_type": {
      "dynamic_templates": [
        {
          "string_as_integer": {
            "match_mapping_type": "string",
            "match": "long_*",//以什么开头
            "unmatch": "*_text",//不以什么结尾
            "mapping": {
              "type": "integer"
            }
          }
        }
      ]
    }
  }
}

举个例子，field : "10"，把类似这种field，弄成long型

PUT /news_website/news/1
{
  "long_field": "20",
  "long_field_text": "20"
}

查看映射结果发现long_field是integer类型，long_field_text没变

7、geo point地理位置数据类型
（1）、建立geo_point类型的mapping

第一个地理位置的数据类型，就是geo_point，说白了，就是一个地理位置坐标点，包含了一个经度，一个维度，经纬度就可以唯一定位一个地球上的坐标

PUT /my_index 
{
  "mappings": {
    "my_type": {
      "properties": {
        "location": {
          "type": "geo_point"
        }
      }
    }
  }
}

（2）、写入geo_point的3种方法

PUT my_index/my_type/1
{
  "text": "Geo-point as an object",
  "location": { 
    "lat": 41.12,
    "lon": -71.34
  }
}

PUT my_index/my_type/2
{
  "text": "Geo-point as a string",
  "location": "41.12,-71.34" 
}

PUT my_index/my_type/4
{
  "text": "Geo-point as an array",
  "location": [ -71.34, 41.12 ] 
}

latitude：维度
longitude：经度
不建议用下面两种语法
（3）、根据地理位置进行查询
最最简单的，根据地理位置查询一些点，比如说，下面geo_bounding_box查询，查询某个矩形的地理位置范围内的坐标点

GET /my_index/my_type/_search 
{
  "query": {
    "geo_bounding_box": {
      "location": {
        "top_left": {
          "lat": 42,
          "lon": -72
        },
        "bottom_right": {
          "lat": 40,
          "lon": -74
        }
      }
    }
  }
}

也可以这样写

GET /my_index/my_type/_search
{
  "query": {
    "bool": {
      "must": [
        {"match_all": {}}
      ],
      "filter": {
        "geo_bounding_box": {
          "location": {
            "top_left": {
              "lat": 42,
              "lon": -72.1
            },
            "bottom_right": {
              "lat": 40.717,
              "lon": -74.99
            }
          }
        }
      }
    }
  }
}

（4）、geo_polygon指定多位置，多边形式查找

GET /my_index/my_type/_search 
{
  "query": {
    "bool": {
      "must": [
        {
          "match_all": {}
        }
      ],
      "filter": {
        "geo_polygon": {
          "location": {
            "points": [
              {"lat" : 42, "lon" : -72},
              {"lat" : 40, "lon" : -74},
              {"lat" : 50, "lon" : 90}
            ]
          }
        }
      }
    }
  }
}

（5）、geo_distance查找指定位置范围内的

GET /my_index/my_type/_search 
{
  "query": {
    "bool": {
      "must": [
        {
          "match_all": {}
        }
      ],
      "filter": {
        "geo_distance": {
          "distance": "200km",
          "location": {
            "lat": 40,
            "lon": -70
          }
        }
      }
    }
  }
}

（6）、统计一下，举例我当前坐标的几个范围内的酒店的数量，比如说举例我0_{100m有几个酒店，100m}300m有几个酒店，300m以上有几个酒店

GET /my_index/my_type/_search 
{
  "size": 0,
  "aggs": {
    "agg_by_distance_range": {
      "geo_distance": {
        "field": "location",
        "origin": {
          "lat": 40,
          "lon": -70
        },
        "unit": "mi", 
        "ranges": [
          {
            "to": 100
          },
          {
            "from": 100,
            "to": 300
          },
          {
            "from": 300
          }
        ]
      }
    }
  }
}

m (metres) but it can also accept: m (miles), km (kilometers)

sloppy_arc (the default), arc (most accurate) and plane (fastest)

二十九、Elasticsearch高手进阶相关技术分析

你可能感兴趣的:(二十九、Elasticsearch高手进阶相关技术分析)