关于Elasticsearch中的映射参数与自动映射字段解析,以及为什么聚合不支持text类型探究

在对elasticsearch建立mapping时,使用了map类型 

private Map specs;

使用kibana查看自动映射类型,发现为:

"specs": {
            "properties": {
              "CPU品牌": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },

而不是传统的:

         "price": {
            "type": "long"
          },
          "skus": {
            "type": "keyword",
            "index": false
          },

这样的简单类型,对上面的情况和字段比较陌生,于是去搜集资料,最终在官网找到了相关的解答:

fieldsedit

It is often useful to index the same field in different ways for different purposes. This is the purpose of multi-fields. For instance, a string field could be mapped as a text field for full-text search, and as a keyword field for sorting or aggregations:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "city": {
          "type": "text",
          "fields": {
            "raw": { 
              "type":  "keyword"
            }
          }
        }
      }
    }
  }
}

PUT my_index/_doc/1
{
  "city": "New York"
}

PUT my_index/_doc/2
{
  "city": "York"
}

GET my_index/_search
{
  "query": {
    "match": {
      "city": "york" 
    }
  },
  "sort": {
    "city.raw": "asc" 
  },
  "aggs": {
    "Cities": {
      "terms": {
        "field": "city.raw" 
      }
    }
  }
}

COPY AS CURLVIEW IN CONSOLE 

The city.raw field is a keyword version of the city field.

The city field can be used for full text search.

 

The city.raw field can be used for sorting and aggregations

Note

Multi-fields do not change the original _source field.

The fields setting is allowed to have different settings for fields of the same name in the same index. New multi-fields can be added to existing fields using the PUT mapping API.

Multi-fields with multiple analyzersedit

Another use case of multi-fields is to analyze the same field in different ways for better relevance. For instance we could index a field with the standard analyzer which breaks text up into words, and again with the english analyzer which stems words into their root form:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "text": { 
          "type": "text",
          "fields": {
            "english": { 
              "type":     "text",
              "analyzer": "english"
            }
          }
        }
      }
    }
  }
}

PUT my_index/_doc/1
{ "text": "quick brown fox" } 

PUT my_index/_doc/2
{ "text": "quick brown foxes" } 

GET my_index/_search
{
  "query": {
    "multi_match": {
      "query": "quick brown foxes",
      "fields": [ 
        "text",
        "text.english"
      ],
      "type": "most_fields" 
    }
  }
}

COPY AS CURLVIEW IN CONSOLE 

The text field uses the standard analyzer.

The text.english field uses the english analyzer.

 

Index two documents, one with fox and the other with foxes.

 

Query both the text and text.english fields and combine the scores.

The text field contains the term fox in the first document and foxes in the second document. The text.english field contains fox for both documents, because foxes is stemmed to fox.

The query string is also analyzed by the standard analyzer for the text field, and by the englishanalyzer for the text.english field. The stemmed field allows a query for foxes to also match the document containing just fox. This allows us to match as many documents as possible. By also querying the unstemmed text field, we improve the relevance score of the document which matches foxes exactly.

问题解决,fields即为不同目的以不同方式索引相同字段,达到多方式索引,例如,string 字段可以映射为text全文搜索字段,也可以映射keyword为排序或聚合字段.如其中CPU品牌的text字段可用于搜索分词,而CPU品牌的keyword字段可用于聚合.测试:

GET /goods/_search
{
  "size": 0,
  "aggs": {
    "NAME": {
      "terms": {
        "field": "specs.内存",
        "size": 10
      }
    }
  }
}

搜索结果出现:

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [specs.内存] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "goods",
        "node": "sjredvFNT729Jrv0wvucVA",
        "reason": {
          "type": "illegal_argument_exception",
          "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [specs.内存] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
        }
      }
    ]
  },
  "status": 400
}

为什么会报这个错?

我在官方找到了解释:

fielddataedit

Most fields are indexed by default, which makes them searchable. Sorting, aggregations, and accessing field values in scripts, however, requires a different access pattern from search.

Search needs to answer the question "Which documents contain this term?", while sorting and aggregations need to answer a different question: "What is the value of this field for thisdocument?".

Most fields can use index-time, on-disk doc_values for this data access pattern, but text fields do not support doc_values.

Instead, text fields use a query-time in-memory data structure called fielddata. This data structure is built on demand the first time that a field is used for aggregations, sorting, or in a script. It is built by reading the entire inverted index for each segment from disk, inverting the term ↔︎ document relationship, and storing the result in memory, in the JVM heap.

Fielddata is disabled on text fields by defaultedit

Fielddata can consume a lot of heap space, especially when loading high cardinality text fields. Once fielddata has been loaded into the heap, it remains there for the lifetime of the segment. Also, loading fielddata is an expensive process which can cause users to experience latency hits. This is why fielddata is disabled by default.

If you try to sort, aggregate, or access values from a script on a text field, you will see this exception:

Fielddata is disabled on text fields by default. Set fielddata=true on [your_field_name] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory.

Before enabling fielddataedit

Before you enable fielddata, consider why you are using a text field for aggregations, sorting, or in a script. It usually doesn’t make sense to do so.

A text field is analyzed before indexing so that a value like New York can be found by searching for new or for york. A terms aggregation on this field will return a new bucket and a york bucket, when you probably want a single bucket called New York.

Instead, you should have a text field for full text searches, and an unanalyzed keyword field with doc_values enabled for aggregations, as follows:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "my_field": { 
          "type": "text",
          "fields": {
            "keyword": { 
              "type": "keyword"
            }
          }
        }
      }
    }
  }
}

COPY AS CURLVIEW IN CONSOLE 

Use the my_field field for searches.

Use the my_field.keyword field for aggregations, sorting, or in scripts.

Enabling fielddata on text fieldsedit

You can enable fielddata on an existing text field using the PUT mapping API as follows:

PUT my_index/_mapping/_doc
{
  "properties": {
    "my_field": { 
      "type":     "text",
      "fielddata": true
    }
  }
}

COPY AS CURLVIEW IN CONSOLE 

The mapping that you specify for my_field should consist of the existing mapping for that field, plus the fielddata parameter.

fielddata_frequency_filteredit

Fielddata filtering can be used to reduce the number of terms loaded into memory, and thus reduce memory usage. Terms can be filtered by frequency:

The frequency filter allows you to only load terms whose document frequency falls between a minand max value, which can be expressed an absolute number (when the number is bigger than 1.0) or as a percentage (eg 0.01 is 1% and 1.0 is 100%). Frequency is calculated per segment. Percentages are based on the number of docs which have a value for the field, as opposed to all docs in the segment.

Small segments can be excluded completely by specifying the minimum number of docs that the segment should contain with min_segment_size:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "tag": {
          "type": "text",
          "fielddata": true,
          "fielddata_frequency_filter": {
            "min": 0.001,
            "max": 0.1,
            "min_segment_size": 500
          }
        }
      }
    }
  }
}

简单意思就是说:text字段在默认情况下,禁用Fieddata,而elasticsearch自动映射会给出一个keyword字段用于聚合等操作,为什么text被禁用呢,有两个原因,一个就是Fielddata在加载高基数的text字段时,会消耗大量的堆空间,另一个原因就是对text字段进行聚合通常没有意义,比如

 

A text field is analyzed before indexing so that a value like New York can be found by searching for new or for york. A terms aggregation on this field will return a new bucket and a york bucket, when you probably want a single bucket called New York.

Instead, you should have a text field for full text searches, and an unanalyzed keyword field with doc_values enabled for aggregations, as follows:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "my_field": { 
          "type": "text",
          "fields": {
            "keyword": { 
              "type": "keyword"
            }
          }
        }
      }
    }
  }
}

本来你想对"new york"进行聚合,但是在之前进行了分词,聚合后会有两个桶,分别是new和york,如果你想得到一个桶,就应该有一个启用了聚合的未分析keyword字段doc_values,然后text用于全文搜索,岂不两全其美.

 

下面对keyword字段进行聚合测试,测试成功:

GET /goods/_search
{
  "size": 0,
  "aggs": {
    "NAME": {
      "terms": {
        "field": "specs.内存.keyword",
        "size": 10
      }
    }
  }
}
{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 182,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "NAME": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "4GB",
          "doc_count": 75
        },
        {
          "key": "3GB",
          "doc_count": 49
        },
        {
          "key": "6GB",
          "doc_count": 48
        },
        {
          "key": "2GB",
          "doc_count": 23
        },
        {
          "key": "8GB",
          "doc_count": 2
        }
      ]
    }
  }
}

最后感谢google+baidu.解决了问题.

你可能感兴趣的:(javaweb,JAVAWEB)