1、term vector深入探查数据的情况
(1)、term vector介绍
获取document中的某个field内的各个term的统计信息
(2)、index-iime term vector实验
term vector,涉及了很多的term和field相关的统计信息,有两种方式可以采集到这个统计信息
index-time:你在mapping里配置一下,然后建立索引的时候,就直接给你生成这些term和field的统计信息了
query-time:你之前没有生成过任何的Term vector信息,然后在查看term vector的时候,直接就可以看到了,会on the fly,现场计算出各种统计信息,然后返回给你
PUT /my_index
{
"mappings": {
"my_type": {
"properties": {
"text": {
"type": "text",
"term_vector": "with_positions_offsets_payloads",
"store" : true,
"analyzer" : "fulltext_analyzer"
},
"fullname": {
"type": "text",
"analyzer" : "fulltext_analyzer"
}
}
}
},
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 0
},
"analysis": {
"analyzer": {
"fulltext_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"type_as_payload"
]
}
}
}
}
}
PUT /my_index/my_type/1
{
"fullname" : "Leo Li",
"text" : "hello test test test "
}
PUT /my_index/my_type/2
{
"fullname" : "Leo Li",
"text" : "other hello test ..."
}
GET /my_index/my_type/1/_termvectors
{
"fields" : ["text"],
"offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true
}
term statistics: 设置term_statistics=true;
返回:
total term frequency:(ttf)一个term在所有document中出现的频率;
document frequency:(doc_freq)有多少document包含这个term
field statistics:"field_statistics" : true
返回:
document count(doc_count):有多少document包含这个field;
sum of document frequency:(sum_doc_freq)一个field中所有term的doc_freq之和;
sum of total term frequency:(sum_ttf)一个field中的所有term的ttf之和,是计算的一个index
{
"_index": "my_index",
"_type": "my_type",
"_id": "1",
"_version": 1,
"found": true,
"took": 1,
"term_vectors": {
"text": {
"field_statistics": {
"sum_doc_freq": 6,
"doc_count": 2,
"sum_ttf": 8
},
"terms": {
"hello": {
"doc_freq": 2,
"ttf": 2,
"term_freq": 1,
"tokens": [
{
"position": 0,
"start_offset": 0,
"end_offset": 5,
"payload": "d29yZA=="
}
]
},
"test": {
"doc_freq": 2,
"ttf": 4,
"term_freq": 3,
"tokens": [
{
"position": 1,
"start_offset": 6,
"end_offset": 10,
"payload": "d29yZA=="
},
{
"position": 2,
"start_offset": 11,
"end_offset": 15,
"payload": "d29yZA=="
},
{
"position": 3,
"start_offset": 16,
"end_offset": 20,
"payload": "d29yZA=="
}
]
}
}
}
}
}
(3)、query-time term vector实验
GET /my_index/my_type/1/_termvectors
{
"fields": ["text"],
"offsets": true,
"positions": true,
"term_statistics": true,
"field_statistics": true
}
如果条件允许,你就用query time的term vector就可以
(4)、手动指定doc的term vector
GET /my_index/my_type/_termvectors
{
"doc" : {
"fullname" : "Leo Li",
"text" : "hello test test test"
},
"fields" : ["text"],
"offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true
}
(5)、手动指定analyzer来生成term vector
GET /my_index/my_type/_termvectors
{
"doc" : {
"fullname" : "Leo Li",
"text" : "hello test test test"
},
"fields" : ["text"],
"offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true,
"per_field_analyzer" : {
"text": "standard"
}
}
(6)、terms filter
这个就是说,根据term统计信息,过滤出你想要看到的term vector统计结果
GET /my_index/my_type/_termvectors
{
"doc" : {
"fullname" : "Leo Li",
"text" : "hello test test test"
},
"fields" : ["text"],
"offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true,
"filter" : {
"max_num_terms" : 3,
"min_term_freq" : 1,
"min_doc_freq" : 1
}
}
2、highlight高亮显示
(1)、三种highlight介绍
plain highlight,lucene highlight,默认
posting highlight,index_options=offsets
性能比plain highlight要高,因为不需要重新对高亮文本进行分词
对磁盘的消耗更少
将文本切割为句子,并且对句子进行高亮,效果更好
PUT /blog_website
{
"mappings": {
"blogs": {
"properties": {
"title": {
"type": "text",
"analyzer": "ik_max_word"
},
"content": {
"type": "text",
"analyzer": "ik_max_word",
"index_options": "offsets"
}
}
}
}
}
fast vector highlight
对大field而言(大于1mb),性能更高
PUT /blog_website
{
"mappings": {
"blogs": {
"properties": {
"title": {
"type": "text",
"analyzer": "ik_max_word"
},
"content": {
"type": "text",
"analyzer": "ik_max_word",
"term_vector" : "with_positions_offsets"
}
}
}
}
}
(2)、强制使用某种highlighter,比如对于开启了term vector的field而言,可以强制使用plain highlight
GET /blog_website/blogs/_search
{
"query": {
"match": {
"content": "博客"
}
},
"highlight": {
"fields": {
"content": {
"type": "plain"
}
}
}
}
总结一下,其实可以根据你的实际情况去考虑,一般情况下,用plain highlight也就足够了,不需要做其他额外的设置,如果对高亮的性能要求很高,可以尝试启用posting highlight,如果field的值特别大,超过了1M,那么可以用fast vector highlight
(3)、、高亮片段fragment的设置
GET /blog_website/blogs/_search
{
"query" : {
"match": { "user": "kimchy" }
},
"highlight" : {
"fields" : {
"content" : {"fragment_size" : 150, "number_of_fragments" : 3, "no_match_size": 150 }
}
}
}
fragment_size: 你一个Field的值,比如有长度是1万,但是你不可能在页面上显示这么长啊。。。设置要显示出来的fragment文本判断的长度,默认是100
number_of_fragments:可能你的高亮的fragment文本片段有多个片段,你可以指定就显示几个片段
3、使用search template将搜索模板化
(1)、search template入门
GET /blog_website/blogs/_search/template
{
"inline" : {
"query": {
"match" : {
"{{field}}" : "{{value}}"
}
}
},
"params" : {
"field" : "title",
"value" : "博客"
}
}
翻译过来就是这样:
GET /blog_website/blogs/_search
{
"query": {
"match" : {
"title" : "博客"
}
}
}
(2)、toJson 也就是在json串里面传值
GET /blog_website/blogs/_search/template
{
"inline": "{\"query\": {\"match\": {{#toJson}}matchCondition{{/toJson}}}}",
"params": {
"matchCondition": {
"title": "博客"
}
}
}
(3)、join将传入的多个值以什么连接在一起
GET /blog_website/blogs/_search/template
{
"inline": {
"query": {
"match": {
"title": "{{#join delimiter=' '}}titles{{/join delimiter=' '}}"
}
}
},
"params": {
"titles": ["博客", "网站"]
}
}
翻译过来就是:
GET /blog_website/blogs/_search
{
"query": {
"match" : {
"title" : "博客 网站"
}
}
}
(4)、default value 取默认值
GET /blog_website/blogs/_search/template
{
"inline": {
"query": {
"range": {
"views": {
"gte": "{{start}}",
"lte": "{{end}}{{^end}}20{{/end}}"
}
}
}
},
"params": {
"start": 1,
"end": 10
}
}
(5)conditional(条件判断执行)
es的config/scripts目录下,预先保存这个复杂的模板,后缀名是.mustache,文件名是conditonal,内容是:
{
"query": {
"bool": {
"must": {
"match": {
"line": "{{text}}"
}
},
"filter": {
{{#line_no}}
"range": {
"line_no": {
{{#start}}
"gte": "{{start}}"
{{#end}},{{/end}}
{{/start}}
{{#end}}
"lte": "{{end}}"
{{/end}}
}
}
{{/line_no}}
}
}
}
}
GET /my_index/my_type/_search/template
{
"file": "conditional",
"params": {
"text": "博客",
"line_no": true,
"start": 1,
"end": 10
}
}
5、completion suggest实现搜索提示
PUT /news_website
{
"mappings": {
"news" : {
"properties" : {
"title" : {
"type": "text",
"analyzer": "ik_max_word",
"fields": {
"suggest" : {
"type" : "completion",
"analyzer": "ik_max_word"
}
}
},
"content": {
"type": "text",
"analyzer": "ik_max_word"
}
}
}
}
}
GET /news_website/news/_search
{
"suggest": {
"my-suggest" : {
"prefix" : "大话西游",
"completion" : {
"field" : "title.suggest"
}
}
}
}
6、用动态映射模板定制自己的映射策略
(1)、根据类型匹配映射模板
动态映射模板,有两种方式,
第一种,是根据新加入的field的默认的数据类型,来进行匹配,匹配上某个预定义的模板;
第二种,是根据新加入的field的名字,去匹配预定义的名字,或者去匹配一个预定义的通配符,然后匹配上某个预定义的模板
PUT my_index
{
"mappings": {
"my_type": {
"dynamic_templates": [
{
"integers": {
"match_mapping_type": "long",//如果自动是long就转成integer
"mapping": {
"type": "integer"
}
}
},
{
"strings": {
"match_mapping_type": "string",//如果是string,转成text,加个内置的raw
"mapping": {
"type": "text",
"fields": {
"raw": {
"type": "keyword",
"ignore_above": 500
}
}
}
}
}
]
}
}
}
(2)、根据字段名配映射模板
PUT /my_index
{
"mappings": {
"my_type": {
"dynamic_templates": [
{
"string_as_integer": {
"match_mapping_type": "string",
"match": "long_*",//以什么开头
"unmatch": "*_text",//不以什么结尾
"mapping": {
"type": "integer"
}
}
}
]
}
}
}
举个例子,field : "10",把类似这种field,弄成long型
PUT /news_website/news/1
{
"long_field": "20",
"long_field_text": "20"
}
查看映射结果发现long_field是integer类型,long_field_text没变
7、geo point地理位置数据类型
(1)、建立geo_point类型的mapping
第一个地理位置的数据类型,就是geo_point,说白了,就是一个地理位置坐标点,包含了一个经度,一个维度,经纬度就可以唯一定位一个地球上的坐标
PUT /my_index
{
"mappings": {
"my_type": {
"properties": {
"location": {
"type": "geo_point"
}
}
}
}
}
(2)、写入geo_point的3种方法
PUT my_index/my_type/1
{
"text": "Geo-point as an object",
"location": {
"lat": 41.12,
"lon": -71.34
}
}
PUT my_index/my_type/2
{
"text": "Geo-point as a string",
"location": "41.12,-71.34"
}
PUT my_index/my_type/4
{
"text": "Geo-point as an array",
"location": [ -71.34, 41.12 ]
}
latitude:维度
longitude:经度
不建议用下面两种语法
(3)、根据地理位置进行查询
最最简单的,根据地理位置查询一些点,比如说,下面geo_bounding_box查询,查询某个矩形的地理位置范围内的坐标点
GET /my_index/my_type/_search
{
"query": {
"geo_bounding_box": {
"location": {
"top_left": {
"lat": 42,
"lon": -72
},
"bottom_right": {
"lat": 40,
"lon": -74
}
}
}
}
}
也可以这样写
GET /my_index/my_type/_search
{
"query": {
"bool": {
"must": [
{"match_all": {}}
],
"filter": {
"geo_bounding_box": {
"location": {
"top_left": {
"lat": 42,
"lon": -72.1
},
"bottom_right": {
"lat": 40.717,
"lon": -74.99
}
}
}
}
}
}
}
(4)、geo_polygon指定多位置,多边形式查找
GET /my_index/my_type/_search
{
"query": {
"bool": {
"must": [
{
"match_all": {}
}
],
"filter": {
"geo_polygon": {
"location": {
"points": [
{"lat" : 42, "lon" : -72},
{"lat" : 40, "lon" : -74},
{"lat" : 50, "lon" : 90}
]
}
}
}
}
}
}
(5)、geo_distance查找指定位置范围内的
GET /my_index/my_type/_search
{
"query": {
"bool": {
"must": [
{
"match_all": {}
}
],
"filter": {
"geo_distance": {
"distance": "200km",
"location": {
"lat": 40,
"lon": -70
}
}
}
}
}
}
(6)、统计一下,举例我当前坐标的几个范围内的酒店的数量,比如说举例我0100m有几个酒店,100m300m有几个酒店,300m以上有几个酒店
GET /my_index/my_type/_search
{
"size": 0,
"aggs": {
"agg_by_distance_range": {
"geo_distance": {
"field": "location",
"origin": {
"lat": 40,
"lon": -70
},
"unit": "mi",
"ranges": [
{
"to": 100
},
{
"from": 100,
"to": 300
},
{
"from": 300
}
]
}
}
}
}
m (metres) but it can also accept: m (miles), km (kilometers)
sloppy_arc (the default), arc (most accurate) and plane (fastest)