Elasticsearch07ignore

ignore_above 默认值是256，该参数的意思是，当字段文本的长度大于指定值时，不做倒排索引。
也就是说，当字段文本的长度大于指定值时，聚合、全文搜索都查不到这条数据。

ignore_above 最大值是32766，但是要根据场景来设置，比如说中文最大值应该是设定在10922。

ignore_above 背后实际的含义是，Lucene对一个文本的解析长度，当这个长度大于32766时，将不会落实analyze行为。

Elasticsearch中采用的是字符个数来定义igmore_above值, 而
lucene是采用byte字节；那么每个象形文字在utf-8中占位是3,
每个Literal字符在utf-8中占位是2, 每个ASCII字符在utf-8中占位是1.

象形文字字符(中文、韩文、日文): 10922 个字符(算法是: 32766 / 3).
Literal字符(印度问、俄文): 16383 个字符(算法是: 32766 / 2).
ASCII字符(a-zA-Z0-9以及~!@#$等特殊字符): 32766个字符(算法是: 32766).

测试代码

from elasticsearch import Elasticsearch
from pprint import pprint
import time

es = Elasticsearch(hosts=["192.168.1.100"])

# 建索引, 写入数据.
if not es.indices.exists("my_index"):
    es.indices.create(
        index="my_index",
        body={
            "mappings": {
                "my_type": {
                    "properties": {
                        "message": {
                            "type": "keyword",
                            "ignore_above": 20
                        }
                    }
                }
            }
        }
    )

    # message: 小于20:  被倒排索引，能聚合、能被搜索。
    s1 = es.index(
        index="my_index",
        doc_type="my_type",
        id=1,
        body={
            "message": "Syntax error"
        }
    )

    # message: 等于19:  被倒排索引，能聚合、能被搜索。
    s2 = es.index(
        index="my_index",
        doc_type="my_type",
        id=3,
        body={
            "message": "Syntax error search"
        }
    )

    # message: 等于20  被倒排索引，能聚合、能被搜索。
    s2 = es.index(
        index="my_index",
        doc_type="my_type",
        id=4,
        body={
            "message": "Syntax error elastic"
        }
    )

    # message: 大于20  没有倒排索引，不能聚合、不能被搜索，可以被空搜索。
    s3 = es.index(
        index="my_index",
        doc_type="my_type",
        id=2,
        body={
            "message": "Syntax error with some long stacktrace"
        }
    )

    # message: 象形文字, 20个文字.     被倒排索引，能聚合、能被搜索。
    es.index(
        index="my_index",
        doc_type="my_type",
        id=10,
        body={
            "message": "中国象形文字博大精深喔不知道你是否感兴趣"
        }
    )

    # message: 象形文字, 21个文字.     没有倒排索引，不能聚合、不能被搜索，可以被空搜索。
    es.index(
        index="my_index",
        doc_type="my_type",
        id=11,
        body={
            "message": "中国象形文字博大精深喔不知道你是否感兴趣呢"
        }
    )

    # 等待refresh导磁盘(才能搜索)
    time.sleep(10)


# ignore_above: 大于数值（默认: 256, 本范例: 20）范围将不被聚合.
s4 = es.search(
    index="my_index",
    body={
        "aggs": {
            "messages": {
                "terms": {
                    "field": "message"
                }
            }
        }
    }
)

# ignore_above: 大于数值（默认: 256, 本范例: 20）范围将不被搜索到.
s5 = es.search(
    index="my_index",
    body={
        "query": {
            "query_string": {
                "query": "syntax"
            }
        }
    }
)

# ignore_above: 空搜索可以搜索到.
s6 = es.search(
    index="my_index"
)

print("聚合:")
pprint(s4)

print("\n" * 10, "全文搜索: ")
pprint(s5)

print("\n" * 10, "空搜索: ")
pprint(s6)

参考

[x] ignore_above

Elasticsearch07ignore_above

测试代码

参考

你可能感兴趣的:(Elasticsearch07ignore_above)